From my use case, the Gemini 2.5 is terrible. I have a complex Cython code in a single file (1500 lines) for a Sequence Labeling. Claude and o3 are very good in improving this code and following the commands. The Gemini always try to do unrelated changes. For example, I asked, separately, for small changes such as remove this unused function, or cache the arrays indexes. Every time it completely refactored the code and was obsessed with removing the gil. The output code is always broken, because removing the gil is not easy.
That matches my experience as well. Gemini 2.5 Pro seems better at writing code from scratch, but Claude 3.7 seems much better at refactoring my existing code.
Gemini also seems more likely to come up with 'advanced' ideas (for better or worse). I for example asked both for a fast C++ function to solve an on the surface fairly simple computational geometry problem. Claude solved it in a straight ahead and obvious way. Nothing obviously inefficient, will perform reasonably well for all inputs, but also left some performance on the table. I could also tell at a glance that it was almost certainly correct.
Gemini on the other hand did a bunch of (possibly) clever 'optimisations' and tricks, plus made extensive use of OpenMP. I know from experience that those optimisations will only be faster if the input has certain properties, but will be a massive overhead in other, quite common, cases.
With a bit more prompting and questions from my part I did manage to get both Gemini and Claude to converge on pretty much the same final answer.
You can fix this using a system prompt to force it to reply just with a diff. It makes the generation much faster and much less prone to changing unrelated lines. Also try reducing the temperature to 0.4 for example, I find the default temperature of 1 too high. For sample system prompts see Aider Chat: https://github.com/Aider-AI/aider/blob/main/aider/coders/edi...
This reflects my experience 1:1... even telling 2.5 Pro to focus on the tasks given and ignore everything else leads to it changing unrelated code. It's a frustrating experience because I believe at its core it is more capable than Sonnet 3.5/3.7
> The Gemini always try to do unrelated changes. For example, I asked, separately, for small changes such as remove this unused function
For anything like this, I don’t understand trying to invoke AI. Just open the file and delete the lines yourself. What is AI going to do here for you?
It’s like you are relying 100% on AI when it’s a tool in your toolset.
Playing devils advocate here, it's because removing a function is not always as simple as deleting the lines. Sometimes there are references to that function that you forgot about that the LLM will notice and automatically update for you. Depending on your prompt it will also go find other references outside of the single file and remove those as well. Another possibility is that people are just becoming used to interacting with their codebase through the "chat" interface and directing the LLM to do things so that behavior carries over into all interactions, even perceived "simple" ones.
I like to code with an LLMs help making iterative changes. First do this, then once that code is a good place, then do this, etc. If I ask it to make one change, I want it to make one change only.
For me I had to upload the library's current documentation to it because it was using outdated references and changing everything that was working in the code to broken and not focusing on the parts I was trying to build upon.
If you don't mind me asking how do you go about this?
I hear people commonly mention doing this but I can't imagine people are manually adding every page of the docs for libraries or frameworks they're using since unfortunately most are not in one single tidy page easy to copy paste.
Have the AI write a quick script using bs4 or whatever to take the HTML dump and output json, then all the aider-likes can use that json as documentation. Or just the HTML, but that wastes context window.
If you have access to the documentation source, you can concatenate all files into one. Some software also has docs downloadable as PDF.
using outdated references and docs is something i've experienced more or less with every model i've tried, from time to time
I am hoping MCP will fix this. I am building an MCP integration with kapa.ai for my company to help devs here. I guess this doesn’t work if you don’t add in the tool
That's expected, because they almost all have training cut-off dates from a year ago or longer.
The more interesting question is if feeding in carefully selected examples or documentation covering the new library versions helps them get it right. I find that to usually be the case.
set temperature to 0.4 or lower.
Adjusting temperature is something I often forget. I think Gemini can range between 0.0 <-> 2.0 (1.0 default). Lowering the temp should get more consistent/deterministic results.
How are you asking Gemini 2.5 to change existing code? With Claude 3.7, it's possible to use Claude Code, which gets "extremely fast but untrustworthy intern"-level results. Do you have a prefered setup to use Gemini 2.5 in a similar agentic mode, perhaps using a tool like Cursor or aider?
For all LLMs, I´m using a simple prompt with the complete code in triple quotes and the command at the end, asking to output the complete code of changed functions. Then I use Winmerge to compare the changes and apply. I feel more confident doing this than using Cursor.
Should really check out aider. Automates this but also does things like make a repo map of all your functions / signatures for non-included files so it can get more context.
I mean it's really in how you use it.
The focus on benchmarks affords a tendency to generalize performance as if it's context and user independent.
Each model really is a different piece of software with different capabilities. Really fascinating to see how dramatically different people's assessments are
Yup, gemini 2.5 is bad.
Were you also trying to edit the same code base as the GP or did you evaluate it on some other criteria where it also failed?
I take the same prompt and give it to 3.7, o1 pro, and gemini. I do this for almost everything, and these are large 50k+ context prompts. Gemini is almost always behind