thorum 1 day ago

GPT 4.1 and 4o score very low on the Aider coding benchmark. You only start to get acceptable results with models that score 70%+ in my experience. Even then, don't expect it to do anything complex without a lot of hand-holding. You start to get a sense for what works and what doesn't.

https://aider.chat/docs/leaderboards/

1
bjt12345 22 hours ago

That been said, Claude Sonnet 3.7 seems to do very well at a recursive approach to writing a program whereas other models don't fare as well.

k__ 13 hours ago

Sonnet 3.7 was SOTA for quite some time. I built some nice charts with it. It's a rather simple task, but quite LoC-intensive.