modeless 5 days ago

There are different scores reported by Google for "diff" and "whole" modes, and the others were "diff" so I chose the "diff" score. Hard to make a real apples-to-apples comparison.

2
jsnell 5 days ago

The 73% on the current leaderboard is using "diff", not "whole". (Well, diff-fenced, but the difference is just the location of the filename.)

modeless 5 days ago

Huh, seems like Aider made a special mode specifically for Gemini[1] some time after Google's announcement blog post with official performance numbers. Still not sure it makes sense to quote that new score next to the others. In any case Gemini's 69% is the top score even without a special mode.

[1] https://aider.chat/docs/more/edit-formats.html#diff-fenced:~...

jsnell 5 days ago

The mode wasn't added after the announcement, Aider has had it for almost a year: https://aider.chat/HISTORY.html#aider-v0320

This benchmark has an authoritative source of results (the leaderboard), so it seems obvious that it's the number that should be used.

modeless 5 days ago

OK but it was still added specifically to improve Gemini and nobody else on the leaderboard uses it. Google themselves do not use it when they benchmark their own models against others. They use the regular diff mode that everyone else uses. https://blog.google/technology/google-deepmind/gemini-model-...

tcdent 5 days ago

They just pick the best performer out of the built-in modes they offer.

Interesting data point about the models behavior, but even moreso it's a recommendation of which way to configure the model for optimal performance.

I do consider this to be an apple-to-apples benchmark since they're evaluating real world performance.