Idea I've had for years but never got around to testing due to lack of compute:
Use a microphone array and LIDAR for ground truth, and train a diffusion model to "imagine" what the world looks like conditioned on some signal transformations of the microphone data only.
Could be used by autonomous vehicles to "see" pedestrians through bushes, early detect oncoming emergency vehicles, hear bicyclists before they are visible, and lots of other good things.
This already exists, it's the domain of inverse problems. Inverse problems consider a forward problem (in this case wave propagation) depending on some physical parameters or domain geometry, and deduce the parameters or geometry from observations.
Conceptually, it's quite simple, you need to derive a gradient of the output error with respect to the sought information. And then use that to minimize the error (= "loss function" or "objective" depending on field terminology), like you do in neural networks.
In many cases, the solution is not unique, unfortunately. The choice of emitters and receivers locations is crucial in the case you're interested in.
There's a lot of literature on this topic already, try "acoustic inverse problem" on google scholar.
So basically a kind of passive echolocation?
I like it. I think you'd need to be in known motion around the area to build up a picture -- I don't think it would work with a microphone just sitting in place.
Sort of!
If you shut your eyes off and you hear footsteps to your right you have a good idea of exactly what you're hearing -- you can probably infer if it's a child or adult, you can possibly infer if they are masculine or feminine shoes, you can even infer formal vs informal attire based on the sound of the shoes (sneakers, sandals, and dress shoes all sound different), and you can locate their angle and distance pretty well. And that's with just your organic 2-microphone array.
I imagine multiple microphones and phase info could do a lot better in accurately placing the objects they hear.
It doesn't need to build an accurate picture of everything, it just needs to be good at imagining the stuff that actually matters, e.g. pedestrians, emergency vehicles. Where the model decides to place a few rustling leaves or what color it imagines a person to be wearing is less relevant than the fact that it decided there is likely a person in some general direction even if they are not visible.
I just think diffusion models are relatively good at coming up with something explainable and plausible for a given condition, when trained on some distribution of data.
Like "oh I hear this, that, and that -- what reality could explain those observations from the distribution of realities that I have seen?"
Sounds like passive radar moved to the accoustic domain. It's a neat thing and there is some open source work around it. However it's also a good way to run afoul of ITAR, passive radar is still one of those secret sauce, software is a munition type things.
I have a passive radar. It also is a direction finding radio. I didn't have to jump through any hoops.
On recent devices with on-device NPU, could be combined with RF imaging of nearby activity and structure via WiFi 7 Sensing Doppler radar.