Maybe I am not understanding the paper correctly, but it seems they tested "state of the art models" which is almost entirely composed of open source <27B parameter models. Mostly 8B and 3B models. This is kind of like giving algebra problems to 7 year olds to "test human algebra ability."
If you are holding up a 3B parameter model as an example of "LLM's can't reason" I'm not sure if the authors are confused or out of touch.
I mean, they do test 4o and O1 preview, but their performance is notablely absent from the paper's conclusion.
It’s difficult to reproducibly test openai models, since they can change from under you and you don’t have control over every hyperparameter.
It would’ve been nice to see one of the larger llama models though.
The results are there, it's just hidden away in the appendix. The result is that those models they don't actually suffer drops on 4/5 of their modified benchmarks. The one benchmark that does see actual drops that aren't explained by margin of error is the benchmark that adds "seemingly relevant but ultimately irrelevant information to problems"
Those results are absent from the conclusion because the conclusion falls apart otherwise.