as noted elsewhere, all other frontier models failed miserably at this
It is unsurprising that some lossily-compressed-database search programs might be worse for some tasks than other lossily-compressed-database search programs.
That doesn't mean the one what manages to spit it out of its latent space is close to AGI. I wonder how consistently that specific model could. If you tried 10 LLMs maybe all 10 of them could have spit out the answer 1 out of 10 times. Correct problem retrieval by one LLM and failure by the others isn't a great argument for near-AGI. But LLMs will be useful in limited domains for a long time.