Item 44006140

lxgr • 10 hours ago

My understanding is that ollama is more of an "LLM backend", i.e. it provides a server process on your machine that answers requests relatively statelessly.

I believe it keeps the model loaded across sessions, and possibly keeps the KV cache warm for ongoing sessions (but I doubt it, based on the API shape; I don't see a "session" parameter), but that's about it. Nothing seems to be written to disk.

Features like ChatGPT's "memories" or cross-chat context require a persistence layer that's probably best suited for a "frontend". Ollama's API does support passing in requests with history, for example: https://github.com/ollama/ollama/blob/main/docs/api.md#chat-...

codybontecou • 9 hours ago

Is there more to memory than just an entry into the context/messages array passed to the LLM?

1 reply

lxgr • 8 hours ago

There must be some heavy compression/filtering going on, as there's no chance GPT can hold everybody's entire ChatGPT conversation history in its context.

But practically, I believe that Ollama just doesn't have a concept of server-side persistent state at the moment to even do such a thing.

1 reply

codybontecou • 8 hours ago

I _think_ the compression used is literally “Chat, compress this array of messages”. This is the technique used in Claude Plays Pokemon.

I’m sure there’s more to the prompt and what to do with this newly generated messages array, but the gist is there.

If this is the case, an Ollama implementation shouldn’t be too difficult.