In general the position of the microphones in space must be known precisely for the phase shifting math to be done well, and also the clocks on the phones would need to be in sync at high precision like 10x the highest frequency sound you're picking up. In other words within 10s of thousands of a second. Also if the array mic locations is not a simple straight line, circle, or other simple geometry the computer code (ie. math) to milk out an improved signal becomes very difficult.
> 10s of thousands of a second
10ms? That's a very long time. Phone clocks are much more accurate than that because they're synced to the atomic clocks in cell towers and GPS satellites.
Hell even NTP can do 1ms over the internet. AFAIK the only modern devices with >10ms inaccurate clocks by default are Windows desktops. I complained about that before because it screwed up my one-way latency measurements: https://github.com/microsoft/WSL/issues/6310
I solved that problem by RTFM and toggling some settings until I got the same accuracy as Linux: https://learn.microsoft.com/en-us/windows-server/networking/...
Anyway I dunno why the math would be too complicated, GPUs are great at this kind of signal processing
What I meant by that millisecond order of magnitude was that the clocks on the phones would need to be highly synchronized, with each other, to high precision, which would require pre-planning and special efforts.
In 10ms sound can travel about 3 meters, which is on the order of magnitude of a room, and represents the range of time offsets we're talking about. This has nothing to do with the actual frequencies of the sound itself, or the rate of PCM-type sampling you need to record quality sound. That's a different issue, and doesn't have to do with synchronization of different devices.
Regarding the math: A circular array is better than a grid (or random placement) because there's only one single math formula that's used to compare any mic to any other mic. With a grid array the number of unique formulas involved goes up as the square of the size of the array. And the mics at the 'center' of a grid are basically worthless, and offer no added value.