marsh_mellow 5 days ago

I don't think the absolute score means much — judge models have a tendency to score around 7/10 lol

55% vs. 45% equates to about a 36 point difference in ELO. in chess that would be two players in the same league but one with a clear edge

1
kevmo314 5 days ago

Rarely are two models put head-to-head though. If Claude Sonnet 3.7 isn't able to generate a good PR review (for whatever reason), a 2% better review isn't all that strong of a value proposition.

swyx 5 days ago

the point is oai is saying they have a viable Claude Sonnet competitor now