Some of the thinking models might recover... with an extra 4k tokens used up in <thinking>. And even if they were stable at long contexts, the speed drops massively. You just can't win with this architecture lol.
That is very accurate with what we have found. <thinking> models do a lot better, but with huge speed drops. For now, we have chosen accuracy over speed. But speed drop is like 3-4x - so we might move to an architecture where we 'think' only sporadically.
Everything happening in the LLM space is so close to how humans think naturally.