Sort of!
If you shut your eyes off and you hear footsteps to your right you have a good idea of exactly what you're hearing -- you can probably infer if it's a child or adult, you can possibly infer if they are masculine or feminine shoes, you can even infer formal vs informal attire based on the sound of the shoes (sneakers, sandals, and dress shoes all sound different), and you can locate their angle and distance pretty well. And that's with just your organic 2-microphone array.
I imagine multiple microphones and phase info could do a lot better in accurately placing the objects they hear.
It doesn't need to build an accurate picture of everything, it just needs to be good at imagining the stuff that actually matters, e.g. pedestrians, emergency vehicles. Where the model decides to place a few rustling leaves or what color it imagines a person to be wearing is less relevant than the fact that it decided there is likely a person in some general direction even if they are not visible.
I just think diffusion models are relatively good at coming up with something explainable and plausible for a given condition, when trained on some distribution of data.
Like "oh I hear this, that, and that -- what reality could explain those observations from the distribution of realities that I have seen?"
Sounds like passive radar moved to the accoustic domain. It's a neat thing and there is some open source work around it. However it's also a good way to run afoul of ITAR, passive radar is still one of those secret sauce, software is a munition type things.
I have a passive radar. It also is a direction finding radio. I didn't have to jump through any hoops.