I was wondering the same thing but I guess the key is probably that you don't actually have to do it correctly every time. Just tokenizing based on a few common characteristics (brace pairing, quotes, indentation, newlines, etc.) should let you trim a ton without knowing anything about the language, I imagine?
My real worry is what if this ends up running dangerous code. Like what if you have a disabled line that writes instead of reading, that randomly gets reactivated?
It's intended to run for producing compiler test cases, so there shouldn't be any code that's actually running.
CPython includes a flag to only run parsing/compiling to bytecode. While you can use it like they did here and run the code - it really depends on how much you trust every possible subset of your code