You can watch network traffic for data leaving the device. Trust but verify.
For something as compressible as voice, I do not know how you would feel confident that data was not slipping through. Edge transcription models (eg Whisper) are continuing to get better, so it would be possible for malware to send a single bit if a user says a trigger word.
Good luck auditing even just a single day of moderately active web browsing.
It's easier than reading all of the code in Ubuntu.
But still entirely impossible. So does it matter?
Network traffic monitoring is routinely done at enterprises. It's usually part-automated using the typical approaches (rules and AI), and part-manual (via a dedicated SOC team).
There are actual compromises caught this way too, it's not (entirely) just for show. A high-profile example would be Kaspersky catching a sophisticated data exfiltration campaign at their own headquarters: https://www.youtube.com/watch?v=1f6YyH62jFE
So it is definitely possible, just maybe not how you imagine it being done.
I do believe that it sometimes works, but it's effectively like missile defense: Immensely more expensive for the defender than for the attacker.
If the attacker has little to lose (e.g. because they're anonymous, doing this massively against many unsuspecting users etc.), the chance of them eventually succeeding is almost certain.
All cyberdefenses I'm aware of are asymmetric in nature like that, unfortunately.