InkCanon 5 days ago

>4.1 Was better in 55% of cases

Um, isn't that just a fancy way of saying it is slightly better

>Score of 6.81 against 6.66

So very slightly better

3
wiz21c 5 days ago

"they found that GPT‑4.1 excels at both precision..."

They didn't say it is better than Claude at precision etc. Just that it excels.

Unfortunately, AI has still not concluded that manipulations by the marketing dept is a plague...

kevmo314 5 days ago

A great way to upsell 2% better! I should start doing that.

neuroelectron 5 days ago

Good marketing if you're selling a discount all purpose cleaner, not so much for an API.

marsh_mellow 5 days ago

I don't think the absolute score means much — judge models have a tendency to score around 7/10 lol

55% vs. 45% equates to about a 36 point difference in ELO. in chess that would be two players in the same league but one with a clear edge

kevmo314 5 days ago

Rarely are two models put head-to-head though. If Claude Sonnet 3.7 isn't able to generate a good PR review (for whatever reason), a 2% better review isn't all that strong of a value proposition.

swyx 5 days ago

the point is oai is saying they have a viable Claude Sonnet competitor now