“GPT‑4.1 scores 54.6% on SWE-bench Verified, improving by 21.4%abs over GPT‑4o and 26.6%abs over GPT‑4.5—making it a leading model for coding.”
4.1 is 26.6% better at coding than 4.5. Got it. Also…see the em dash
What's wrong with the em-dash? That's just...the typographically correct dash AFAIK.
Should have named it 4.10
But it’s so much weaker than 4.5 in broader tasks… maybe more optimized against benchmarks but it’s just no replacement for a huge model.