elashri 5 days ago

Are there any benchmarks or someone who did tests of performance of using this long max token models in scenarios where you actually use more of this token limit?

I found from my experience with Gemini models that after ~200k that the quality drops and that it basically doesn't keep track of things. But I don't have any numbers or systematic study of this behavior.

I think all providers who announce increased max token limit should address that. Because I don't think it is useful to just say that max allowed tokens are 1M when you basically cannot use anything near that in practice.

7
kmeisthax 5 days ago

The problem is that while you can train a model with the hyperparameter of "context size" set to 1M, there's very little 1M data to train on. Most of your model's ability to follow long context comes from the fact that it's trained on lots of (stolen) books; in fact I believe OpenAI just outright said in court that they can't do long context without training on books.

Novels are usually measured in terms of words; and there's a rule of thumb that four tokens make up about three words. So that 200k token wall you're hitting is right when most authors stop writing. 150k is already considered long for a novel, and to train 1M properly, you'd need not only a 750k book, but many of them. Humans just don't write or read that much text at once.

To get around this, whoever is training these models would need to change their training strategy to either:

- Group books in a series together as a single, very long text to be trained on

- Train on multiple unrelated books at once in the same context window

- Amplify the gradients by the length of the text being trained on so that the fewer long texts that do exist have greater influence on the model weights as a whole.

I suspect they're doing #2, just to get some gradients onto the longer end of the context window, but that also is going to diminish long-context reasoning because there's no reason for the model to develop a connection between, say, token 32 and token 985,234.

omneity 5 days ago

I'm not sure to which extent this opinion is accurately informed. It is well known that nobody trains on 1M token-long content. It wouldn't work anyway as the dependencies are too far fetched and you end up with vanishing gradients.

RoPE (Rotary Positional Embeddings, think modulo or periodic arithmetics) scaling is key, whereby the model is trained on 16k tokens long content, and then scaled up to 100k+ [0]. Qwen 1M (who has near perfect recall over the complete window [1]) and Llama 4 10M pushed the limits of this technique, with Qwen reliably training with a much higher RoPE base, and Llama 4 coming up with iRoPE which claims scaling to extremely long contexts up to infinity.

[0]: https://arxiv.org/html/2310.05209v2

[1]: https://qwenlm.github.io/blog/qwen2.5-turbo/#passkey-retriev...

christianqchung 5 days ago

But Llama 4 Scout does badly on long context benchmarks despite claiming 10M. It scores 1 slot above Llama 3.1 8B in this one[1].

[1] https://github.com/adobe-research/NoLiMa

omneity 5 days ago

Indeed, but it does not take away the fact that long context is not trained through long content but by scaling short content instead.

kmeisthax 5 days ago

Is there any evidence that GPT-4.1 is using RoPE to scale context?

Also, I don't know about Qwen, but I know Llama 4 has severe performance issues, so I wouldn't use that as an example.

omneity 5 days ago

I am not sure about public evidence. But the memory requirements alone to train on 1M long windows would make it a very unrealistic proposition compared to RoPE scaling. And as I mentioned RoPE is essential for long context anyway. You can't train it in the "normal way". Please see the paper I linked previously for more context (pun not intended) on RoPE.

Re: Llama 4, please see the sibling comment.

killerstorm 4 days ago

No, there's a fundamental limitation of Transformer architecture:

  * information from the entire context has to be squeezed into an information channel of a fixed size; the more information you try to squeeze the more noise you get
  * selection of what information passes through is done using just dot-product
Training data isn't the problem.

In principle, as you scale transformer you get more heads and more dimensions in each vector, so bandwidth of attention data bus goes up and thus precision of recall goes up too.

wskish 5 days ago

codebases of high quality open source projects and their major dependencies are probably another good source. also: "transformative fair use", not "stolen"

crimsoneer 5 days ago

Isn't the problem more that the "needle in a haystack" eval (i said word X once, where) is really not relevant to most long context LLM use cases like code, where you need the context from all the stuff simultaneously rather than identifying a single, quite separate relevant section?

omneity 5 days ago

What you're describing as "needle in a haystack" is a necessary requirement for the downstream ability you want. The distinction is really how many "things" the LLM can process in a single shot.

LLMs process tokens sequentially, first in a prefilling stage, where it reads your input, then in the generation stage where it outputs response tokens. The attention mechanism is what allows the LLM as it is ingesting or producing tokens to "notice" that a token it has seen previously (your instruction) is related with a token it is now seeing (the code).

Of course this mechanism has limits (correlated with model size), and if the LLM needs to take the whole input in consideration to answer the question the results wouldn't be too good.

roflmaostc 5 days ago

What about old books? Wikipedia? Law texts? Programming languages documentations?

How many tokens is a 100 pages PDF? 10k to 100k?

arvindh-manian 5 days ago

For reference, I think a common approximation is one token being 0.75 words.

For a 100 page book, that translates to around 50,000 tokens. For 1 mil+ tokens, we need to be looking at 2000+ page books. That's pretty rare, even for documentation.

It doesn't have to be text-based, though. I could see films and TV shows becoming increasingly important for long-context model training.

handfuloflight 5 days ago

What about the role of synthetic data?

throwup238 5 days ago

Synthetic data requires a discriminator that can select the highest quality results to feed back into training. Training a discriminator is easier than a full blown LLM, but it still suffers from a lack of high quality training data in the case of 1M context windows. How do you train a discriminator to select good 2,000 page synthetic books if the only ones you have to train it with are Proust and concatenated Harry Potter/Game of Thrones/etc.

jjmarr 5 days ago

Wikipedia does not have many pages that are 750k words. According to Special:LongPages[1], the longest page right now is a little under 750k bytes.

https://en.wikipedia.org/wiki/List_of_chiropterans

Despite listing all presently known bats, the majority of "list of chiropterans" byte count is code that generates references to the IUCN Red List, not actual text. Most of Wikipedia's longest articles are code.

[1] https://en.wikipedia.org/wiki/Special:LongPages

nneonneo 5 days ago

I mean, can’t they just train on some huge codebases? There’s lots of 100KLOC codebases out there which would probably get close to 1M tokens.

enginoid 5 days ago

There are some benchmarks such as Fiction.LiveBench[0] that give an indication and the new Graphwalks approach looks super interesting.

But I'd love to see one specifically for "meaningful coding." Coding has specific properties that are important such as variable tracking (following coreference chains) described in RULER[1]. This paper also cautions against Single-Needle-In-The-Haystack tests which I think the OpenAI one might be. You really need at least Multi-NIAH for it to tell you anything meaningful, which is what they've done for the Gemini models.

I think something a bit more interpretable like `pass@1 rate for coding turns at 128k` would so much more useful than "we have 1m context" (with the acknowledgement that good-enough performance is often domain dependant)

[0] https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o...

[1] https://arxiv.org/pdf/2404.06654

daemonologist 5 days ago

I ran NoLiMa on Quasar Alpha (GPT-4.1's stealth mode): https://news.ycombinator.com/item?id=43640166#43640790

Updated results from the authors: https://github.com/adobe-research/NoLiMa

It's the best known performer on this benchmark, but still falls off quickly at even relatively modest context lengths (85% perf at 16K). (Cutting edge reasoning models like Gemini 2.5 Pro haven't been evaluated due to their cost and might outperform it.)

jbentley1 5 days ago

https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o...

IMO this is the best long context benchmark. Hopefully they will run it for the new models soon. Needle-in-a-haystack is useless at this point. Llama-4 had perfect needle in a haystack results but horrible real-world-performance.

dr_kiszonka 5 days ago

As much as I enjoy Gemini models, I have to agree with you. At some point, interactions with them start resembling talking to people with short-term memory issues, and answers become increasingly unreliable. Now, there are also reports of AI Studio glitching out and not loading these longer conversations.

Is there a reliable method for pruning, summarizing, or otherwise compressing context to overcome such issues?

consumer451 5 days ago

This is a paper which echoes your experience, in general. I really wish that when papers like this one were created, someone took the methodology and kept running with it for every model:

> For instance, the NoLiMa benchmark revealed that models like GPT-4o experienced a significant drop from a 99.3% performance rate at 1,000 tokens to 69.7% at 32,000 tokens. Similarly, Llama 3.3 70B's effectiveness decreased from 97.3% at 1,000 tokens to 42.7% at 32,000 tokens, highlighting the challenges LLMs face with longer contexts.

https://arxiv.org/abs/2502.05167

gymbeaux 5 days ago

I’m not optimistic. It’s the Wild West and comparing models for one’s specific use case is difficult, essentially impossible at scale.