Also probably in the training data: https://www.quora.com/What-is-the-least-odd-prime-factor-of-...
It's a public AIME problem from 2019.
People have to realize that many problems that are hard for humans are in a dataset somewhere.
In a twofold way: 1) Don't bother testing it with reasoning problems with an example you pulled from a public data set 2) Search the problem you think is novel and see if you already get an answered match in seconds instead of waiting up to minutes for an LLM to attempt to reproduce it.
There is an in-between measure of usefulness which is to take a problem you know is in the dataset and modify it to values not in the dataset on measure how often it is able to accurately adapt to the right values in its response directly. This is less a test of reasoning strength and more a test of whether or not a given model is more useful than searching its data set.