Item 43683502

No benchmark comparisons to other models, especially Gemini 2.5 Pro, is telling.

dmd • 5 days ago

Gemini 2.5 Pro gets 64% on SWE-bench verified. Sonnet 3.7 gets 70%

They are reporting that GPT-4.1 gets 55%.

egeozcan • 5 days ago

Very interesting. For my use cases, Gemini's responses beat Sonnet 3.7's like 80% of the time (gut feeling, didn't collect actual data). It beats Sonnet 100% of the time when the context gets above 120k.

1 reply

int_19h • 5 days ago

As usual with LLMs. In my experience, all those metrics are useful mainly to tell which models are definitely bad, but doesn't tell you much about which ones are good, and especially not how the good ones stack against each other in real world use cases.

Andrej Karpathy famously quipped that he only trusts two LLM evals: Chatbot Arena (which has humans blindly compare and score responses), and the r/LocalLLaMA comment section.

1 reply

ezyang • 5 days ago

Lmarena isn't that useful anymore lol

1 reply

int_19h • 5 days ago

I actually agree with that, but it's generally better than other scores. Also, the quote is like a year old at this point.

In practice you have to evaluate the models yourself for any non-trivial task.

hmottestad • 5 days ago

Are those with «thinking» or without?

4 replies

sanxiyn • 5 days ago

Sonnet 3.7's 70% is without thinking, see https://www.anthropic.com/news/claude-3-7-sonnet

aledalgrande • 5 days ago

The thinking tokens (even just 1024) make a massive difference in real world tasks with 3.7 in my experience

chaos_emergent • 5 days ago

based on their release cadence, I suspect that o4-mini will compete on price, performance, and context length with the rest of these models.

1 reply

hecticjeff • 5 days ago

o4-mini, not to be confused with 4o-mini

energy123 • 5 days ago

With

poormathskills • 5 days ago

Go look at their past blog posts. OpenAI only ever benchmarks against their own models.

This is pretty common across industries. The leader doesn’t compare themselves to the competition.

6 replies

christianqchung • 5 days ago

Okay, it's common across other industries, but not this one. Here is Google, Facebook, and Anthropic comparing their frontier models to others[1][2][3].

[1] https://blog.google/technology/google-deepmind/gemini-model-...

[2] https://ai.meta.com/blog/llama-4-multimodal-intelligence/

[3] https://www.anthropic.com/claude/sonnet

1 reply

poormathskills • 5 days ago

Right. Those labs aren’t leading the industry.

1 reply

comp_throw7 • 5 days ago

Confusing take - Gemini 2.5 is probably the best general purpose coding model right now, and before that it was Sonnet 3.5. (Maybe 3.7 if you can get it to be less reward-hacky.) OpenAI hasn't had the best coding model for... coming up on a year, now? (o1-pro probably "outperformed" Sonnet 3.5 but you'd be waiting 10 minutes for a response, so.)

oofbaroomf • 5 days ago

Leader is debatable, especially given the actual comparisons...

dimitrios1 • 5 days ago

There is no uniform tactic for this type of marketing. They will compare against whomever they need to to suit their marketing goals.

kweingar • 5 days ago

That would make sense if OAI were the leader.

awestroke • 5 days ago

Except they are far from the lead in model performance

1 reply

poormathskills • 5 days ago

Who has a (publicly released) model that is SOTA is constantly changing. It’s more interesting to see who is driving the innovation in the field, and right now that is pretty clearly OpenAI (GPT-3, first multi-modal model, first reasoning model, ect).

swyx • 5 days ago

also sometimes if you get it wrong you catch unnecessary flak