There are different scores reported by Google for "diff" and "whole" modes, and the others were "diff" so I chose the "diff" score. Hard to make a real apples-to-apples comparison.
The 73% on the current leaderboard is using "diff", not "whole". (Well, diff-fenced, but the difference is just the location of the filename.)
Huh, seems like Aider made a special mode specifically for Gemini[1] some time after Google's announcement blog post with official performance numbers. Still not sure it makes sense to quote that new score next to the others. In any case Gemini's 69% is the top score even without a special mode.
[1] https://aider.chat/docs/more/edit-formats.html#diff-fenced:~...
The mode wasn't added after the announcement, Aider has had it for almost a year: https://aider.chat/HISTORY.html#aider-v0320
This benchmark has an authoritative source of results (the leaderboard), so it seems obvious that it's the number that should be used.
OK but it was still added specifically to improve Gemini and nobody else on the leaderboard uses it. Google themselves do not use it when they benchmark their own models against others. They use the regular diff mode that everyone else uses. https://blog.google/technology/google-deepmind/gemini-model-...
They just pick the best performer out of the built-in modes they offer.
Interesting data point about the models behavior, but even moreso it's a recommendation of which way to configure the model for optimal performance.
I do consider this to be an apple-to-apples benchmark since they're evaluating real world performance.