I read this paper and I still feel lost as to how can this even be possible. It seems to understand how to tokenize, merge lines, remove tokens etc for arbitrary programming languages. Is there another paper that explains this algorithm alone?
I was wondering the same thing but I guess the key is probably that you don't actually have to do it correctly every time. Just tokenizing based on a few common characteristics (brace pairing, quotes, indentation, newlines, etc.) should let you trim a ton without knowing anything about the language, I imagine?
My real worry is what if this ends up running dangerous code. Like what if you have a disabled line that writes instead of reading, that randomly gets reactivated?
It's intended to run for producing compiler test cases, so there shouldn't be any code that's actually running.
CPython includes a flag to only run parsing/compiling to bytecode. While you can use it like they did here and run the code - it really depends on how much you trust every possible subset of your code
It basically runs transformation passes over the code until they stop changing the code or break its behaviour.
It seems like a small number of the passes are not specific to C:
https://github.com/csmith-project/creduce/blob/master/creduc...
`"C" => 1,` means it is only for C.
I would guess `pass_lines` is the most important for non-C code; I'm guessing (it's written in unreadable Perl) that it removes lines.
So while it can work for languages other than C, most of its features are C-specific so it's not going to work nearly as well. Still I'd never heard of C-Reduce; pretty cool tool!
Someone make one based on Tree Sitter, quick!
unreadable perl?! I clicked on the link expecting something super-terse and unstructured, but it's anything but. What the hell is wrong with people casually dropping such remarks.