Item 43534475

llm_nerd • 3 days ago

Google has had almost perfect recall in the needle in the haystack test since 1.5[1], achieving close to 100% over the entire context window. I can't provide a link benchmarking 2.5 Pro in particular, but this has been a solved problem with Google models so I assume the same is true with their new model.

[1] https://cloud.google.com/blog/products/ai-machine-learning/t...

diggan • 3 days ago

Has those results been reproduced elsewhere with other benchmarks than what Google seems to use?

Hard to trust their own benchmarks at this point, and Im not home at the moment so cant try it myself either.

1 reply

llm_nerd • 3 days ago

They are testing for a very straightforward needle retrieval, as LLMs traditionally were terrible for this in longer contexts.

There are some more advanced tests where it's far less impressive. Just a couple of days ago Adobe released one such test- https://github.com/adobe-research/NoLiMa