Aider's benchmarks show 4.1 (and 4o) work better in its architect mode, for planning the changes, and o3 for making the actual edits
You have that backwards. The leaderboard results have the thinking model as the architect.
In this case, o3 is the architect and 4.1 is the editor.
I see o3 (high) + gpt-4.1 at 82.7% -- the highest on the benchmark currently.