Workaccount2 6 days ago

Lately I have been experimenting with ways to work around the "context token sweet spot" of <20k tokens (or <50k with 2.5). Essentially doing manual "context compression", where the LLM works with a database to store things permanently according to a strict schema, summarizes it's current context when it starts to get out of the sweet spot (I'm mixed on whether it is best to do this continuously like a journal, or in retrospect like a closing summary), and then passes this to a new instance with fresh context.

This works really effectively with thinking models, because the thinking eats up tons of context, but also produces very good "summary documents". So you can kind of reap the rewards of thinking without having to sacrifice that juicy sub 50k context. The database also provides a form of fallback, or RAG I suppose, for situations where the summary leaves out important details, but the model must also recognize this and go pull context from the DB.

Right now I have been trying it to make essentially an inventory management/BOM optimization agent for a database of ~10k distinct parts/materials.

1
jasonjmcghee 6 days ago

I am excitedly waiting for the first company (guessing / hoping it'll be anthropic) to invest heavily in improvements to caching.

The big ones that come to mind are cheap long term caching, and innovations in compaction, differential stuff - like is there a way to only use the parts of the cached input context we need?

manmal 5 days ago

Isn’t a problem there that a cache would be model specific, where the cached items are only valid for exactly the same weights and inference engine? I think those are both heavily iterated on.

simonw 5 days ago

Prompt caches right now only last a few minutes - I believe they involve keeping a bunch of calculations in-memory, hence why for Gemini and Anthropic you get charged an initial fee for using the feature (to populate the cache), but then get a discount on prompts that use that cache.