GPT 4.1 and 4o score very low on the Aider coding benchmark. You only start to get acceptable results with models that score 70%+ in my experience. Even then, don't expect it to do anything complex without a lot of hand-holding. You start to get a sense for what works and what doesn't.
That been said, Claude Sonnet 3.7 seems to do very well at a recursive approach to writing a program whereas other models don't fare as well.
Sonnet 3.7 was SOTA for quite some time. I built some nice charts with it. It's a rather simple task, but quite LoC-intensive.