Research by METR suggests that frontier LLMs can perform software tasks over exponentially longer time horizon required for human engineers, with ~7-month for each doubling. o3 is above the trend line.
https://x.com/METR_Evals/status/1912594122176958939
—-
The AlexNet paper which kickstarted the deep learning era in 2012 was ahead of the 2nd-best entry by 11%. Many published AI papers then advanced SOTA by just a couple percentage points.
o3 high is about 9% ahead of o1 high on livebench.ai and there are also quite a few testimonials of their differences.
Yes, AlexNet made major strides in other aspects as well but it’s been just 7 months since o1-preview, the first publicly available reasoning model, which is a seminal advance beyond previous LLMs.
It seems some people have become desensitized to how rapidly things are moving in AI, despite its largely unprecedented pace of progress.
Ref:
- https://proceedings.neurips.cc/paper_files/paper/2012/file/c...
Imagenet had improved the error rate by 100*11/25=44%.
o1 to o3 error rate went from 28 to 19, so 100*9/28=32%.
But these are meaningless comparisons because it’s typically harder to improve already good results.