Can I use the same code to do the opposite. i.e. Remove background music then add my own audio as a background? Also does it matter who is speaking. Does it support new speakers joining in suddenly?
Hi, thanks for the question!
Unfortunately, we don’t yet support user-provided audio as replacement content. Of course, once filtered you could add an audio track over the processed video.
In the future we can try to add this functionality, basically to add any audio track on the source video. It’s an interesting idea!
For your second question: dialogues and multi-speaker videos work pretty well. Shouldn’t be an issue
Maybe I can try implementing it. Do you know any open-source models/frameworks out there for replacing background music? Can audio be logically represented as layers like that (foreground/background)?
Please go ahead! I'd love to see where it goes and would be willing to help out. I've already opened an issue to track "adding support for user-provided audio" as we discussed, see: https://github.com/omeryusufyagci/fast-music-remover/issues/...
As for your question: it depends on the approach; Fast Music Remover currently uses DeepFilterNet, which has a deep learning approach and doesn't identify audio components as logical layers, which is why it's rather fast. Typically for that sort of requirements you'd want to work with a model like `demucs` (https://github.com/facebookresearch/demucs), that can identify individual audio components. That comes at great performance costs though.
However, my vision for the core of FMR is to support multiple ML models and provide the optimal solution for user needs, without them worrying about these details like which model to pick, etc. So, this would definitely be something I'd be interested to follow!