TL;DR
If you want to jump straight to the conclusion, I’d say go for Gemini 2.5 Pro, it’s better at coding, has one million in context window as compared to Claude’s 200k, and you can get it for free (a big plus). However, Claude’s 3.7 Sonnet is not that far behind. Though at this point there’s no point using it over Gemini 2.5 Pro.
> has one million in context window
Is this effective context window or just the absolute limit? A lot of the models that claim to support very large context windows cannot actually successfully do the typical "needle in a haystack" test, but I'm guessing there are published results somewhere demonstrating Gemini 2.5 Pro can actually find the needle?
Google has had almost perfect recall in the needle in the haystack test since 1.5[1], achieving close to 100% over the entire context window. I can't provide a link benchmarking 2.5 Pro in particular, but this has been a solved problem with Google models so I assume the same is true with their new model.
[1] https://cloud.google.com/blog/products/ai-machine-learning/t...
Has those results been reproduced elsewhere with other benchmarks than what Google seems to use?
Hard to trust their own benchmarks at this point, and Im not home at the moment so cant try it myself either.
They are testing for a very straightforward needle retrieval, as LLMs traditionally were terrible for this in longer contexts.
There are some more advanced tests where it's far less impressive. Just a couple of days ago Adobe released one such test- https://github.com/adobe-research/NoLiMa
This is a good question. There's a big difference in being able to write coherent code and "needle in the haystack" questions. I've found that Claude is able to do the needle in the haystack questions just fine with a large context, but not so with coding. You have to work to keep the context low (around 15% to 20% in projects) to get coherent code that doesn't confabulate.
Not sure what happened with Claude 3.7, but 3.5 is way better in all things day to day. 3.7 felt like a major step back especially when it comes to coding even though this was highlighted as one aspect they improved upon. 500k window will soon be released for Claude. Not sure much it will improve anything though.
With Claude 3.7 I keep having to remind it about things, and go back and correct it several times in a row, before cleaning the code up significantly.
For example, yesterday I wanted to make a 'simple' time format, tracking Earths orbits of the Sun, the Moons orbits of Earth and rotations of Earth from a specific given point in time (the most recent 2020 great conjunction) - without directly using any hard-coded constants other than the orbital mechanics and my atomic clock source. Where this would be in the format of `S4.7.... L52... R1293...` for sols, luns & rotations.
I keep having to remind to to go back to first principles, we want actual rotations, real day lengths etc. rather than hard-coded constants that approximate the mean over the year.
How are you getting gemini 2.5 pro for free?
In the gemini iOS app the only available models are currently 2.0 flash and 2.0 flash thinking.
> How are you getting gemini 2.5 pro for free?
I think the "AI Premium" plan of Google One includes access to all the models, including the latest ones (at least that's what it says for me in Spain): https://one.google.com/plans
They just added it to the free tier today.
What does this context window mean, is it the size of the prompt it can be made aware of?
In practice, can you use any of these models with existing code bases of, say, 50k LoC?
If there'd just be an alternative to claude code...