Interesting link. Worth noting that the pull requests were judged by o3-mini. Further, I'm not sure that 55% vs 45% is a huge difference.
Good point. They said they validated the results by testing with other models (including Claude), as well as with manual sanity checks.
55% to 45% definitely isn't a blowout but it is meaningful — in terms of ELO it equates to about a 36 point difference. So not in a different league but definitely a clear edge
Maybe not as much to us, but for people building these tools, 4.1 being significantly cheaper than Clause 3.7 is a huge difference.
I first read it as 55% better, which sounds significantly higher than ~22% which they report here. Sounds misleading.