Item 43534880

MrScruff • 3 days ago

The evidence given really doesn't justify the conclusion. Maybe it suggests 2.5 Pro might be better if you're asking it to build Javascript apps from scratch, but that hardly equates to "It's better at coding". Feels like a lot of LLM articles follow this pattern, someone running their own toy benchmarks and confidently extrapolating broad conclusions from a handful of data points. The SWE-Bench result carries a bit more weight but even that should be taken with a pinch of salt.

throwaway0123_5 • 3 days ago

> The SWE-Bench result carries a bit more weight

Although I have issues with it (few benchmarks are perfect), I tend to agree. Gemini's 63.8 from Sonnet's 62.3 isn't a huge jump though. To Gemini's credit, it solved a bug in my PyTorch code yesterday that o1 (through the web app) couldn't (or at least didn't with my prompts).

namaria • 3 days ago

There are three things this hype cycle excels at. Getting money from investors for foundational model creators and startup.ai; spinning lay offs as a good sign for big corps; and trying to look like clever tech blogger for people looking for clout online.