Item 43684252

InkCanon • 5 days ago

>4.1 Was better in 55% of cases

Um, isn't that just a fancy way of saying it is slightly better

>Score of 6.81 against 6.66

So very slightly better

wiz21c • 5 days ago

"they found that GPT‑4.1 excels at both precision..."

They didn't say it is better than Claude at precision etc. Just that it excels.

Unfortunately, AI has still not concluded that manipulations by the marketing dept is a plague...

kevmo314 • 5 days ago

A great way to upsell 2% better! I should start doing that.

1 reply

neuroelectron • 5 days ago

Good marketing if you're selling a discount all purpose cleaner, not so much for an API.

marsh_mellow • 5 days ago

I don't think the absolute score means much — judge models have a tendency to score around 7/10 lol

55% vs. 45% equates to about a 36 point difference in ELO. in chess that would be two players in the same league but one with a clear edge

1 reply

kevmo314 • 5 days ago

Rarely are two models put head-to-head though. If Claude Sonnet 3.7 isn't able to generate a good PR review (for whatever reason), a 2% better review isn't all that strong of a value proposition.

1 reply

swyx • 5 days ago

the point is oai is saying they have a viable Claude Sonnet competitor now