Interesting that GPT 4.5 seems significantly better than 4o. I dimly remember the feedback being that it wasn't such a big leap in performance, though of course the usual problem solving benchmarks might not correlate with what was asked here. Seems it got better at human-like speech, at the very least, which I think was also some of the feedback when 4.5 was released.
I still believe that larger models are better at covering the long tail. Our benchmarks are saturated, but actual model capability is not.