omneity 5 days ago

I'm not sure to which extent this opinion is accurately informed. It is well known that nobody trains on 1M token-long content. It wouldn't work anyway as the dependencies are too far fetched and you end up with vanishing gradients.

RoPE (Rotary Positional Embeddings, think modulo or periodic arithmetics) scaling is key, whereby the model is trained on 16k tokens long content, and then scaled up to 100k+ [0]. Qwen 1M (who has near perfect recall over the complete window [1]) and Llama 4 10M pushed the limits of this technique, with Qwen reliably training with a much higher RoPE base, and Llama 4 coming up with iRoPE which claims scaling to extremely long contexts up to infinity.

[0]: https://arxiv.org/html/2310.05209v2

[1]: https://qwenlm.github.io/blog/qwen2.5-turbo/#passkey-retriev...

2
christianqchung 5 days ago

But Llama 4 Scout does badly on long context benchmarks despite claiming 10M. It scores 1 slot above Llama 3.1 8B in this one[1].

[1] https://github.com/adobe-research/NoLiMa

omneity 5 days ago

Indeed, but it does not take away the fact that long context is not trained through long content but by scaling short content instead.

kmeisthax 5 days ago

Is there any evidence that GPT-4.1 is using RoPE to scale context?

Also, I don't know about Qwen, but I know Llama 4 has severe performance issues, so I wouldn't use that as an example.

omneity 5 days ago

I am not sure about public evidence. But the memory requirements alone to train on 1M long windows would make it a very unrealistic proposition compared to RoPE scaling. And as I mentioned RoPE is essential for long context anyway. You can't train it in the "normal way". Please see the paper I linked previously for more context (pun not intended) on RoPE.

Re: Llama 4, please see the sibling comment.