arvindh-manian 5 days ago

Interesting link. Worth noting that the pull requests were judged by o3-mini. Further, I'm not sure that 55% vs 45% is a huge difference.

3
marsh_mellow 5 days ago

Good point. They said they validated the results by testing with other models (including Claude), as well as with manual sanity checks.

55% to 45% definitely isn't a blowout but it is meaningful — in terms of ELO it equates to about a 36 point difference. So not in a different league but definitely a clear edge

servercobra 4 days ago

Maybe not as much to us, but for people building these tools, 4.1 being significantly cheaper than Clause 3.7 is a huge difference.

elAhmo 5 days ago

I first read it as 55% better, which sounds significantly higher than ~22% which they report here. Sounds misleading.