Hello Simon, sorry for asking about this tangent here, but have you seen this paper? Is it as important as it appears to be? Should this metric be on all system cards?
> We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%.
I had not seen that one! That's really interesting. I'd love to see them run that against Gemini 2.5 Pro and Gemini 2.5 Flash, to my understanding they're way ahead of other models on the needle in a haystack tests these days.
Yes, I wish their methodology was run against new models, in an on-going fashion.