My whole team feels like 3.7 is a letdown. It really struggles to follow instructions as others are mentioning.
Makes me think they really just hacked the benchmarks on this one.
Claude Sonnet 3.7 Thinking is also an unmitigated disaster for coding. I was mistaken that a "thinking" model would be better at logic. It turns out "thinking" is a marketing term, a euphemism for "hallucinating" ... though, not unsurprising when you actually take a look at the model cards for these "reasoning" / "thinking" LLMs; however, I've found these to work nicely for IR (information retrieval).
Overthinking without extra input is always bad.
It's super bad for humans too. You start to spiral down a dark path when your thoughts run away and make up theories and base more theories on those etc.
They definitely over-optimized it for agentic use - where the quality of the code doesn't matter as much as it's ability to run, even if just barely. When you view it from that perspective all that nested errors handling, excessive comments, 10 lines that can be done in 2, etc. start to make sense.