noodletheworld 6 days ago

> that information applies to all interactions made via that session

Humm.. maybe you should run an llama.cpp server in debug mode and review the content that goes to the actual LLM; you can do that with the verbose flag or `OLLAMA_DEBUG=1` (if you use ollama).

What you are describing is not how it works.

There is no such thing as an LLM 'session'.

That is a higher level abstraction that sits on top of an API that just means some server is caching part of your prompt and taking some fragment you typed in the UI and combining them on the server side before feeding them to the LLM.

It makes no difference how it is implemented technically.

Fundamentally; any request you make which can invoke tools will be transformed, as some point, into a definition that includes the tool definitions before it is passed to the LLM.

That has a specific, measurable cost on LLM performance as the number of tool definitions go up.

The only solution to that is to limit the number of tools you have enabled; which is entirely possible and reasonable to do, by the way.

My point is that adding more and more and more tools doesn't scale and doesn't work.

It only works when you have a few tools.

If you have 50 MCP servers enabled, your requests are probably degraded.

1
kagevf 6 days ago

> There is no such thing as an LLM 'session'.

This matches my understanding too, at least how it works with Open AI. To me, that would explain why there's a 20 or 30 question limit for a conversation, because the necessary context that needs to be sent with each request would necessarily grow larger and larger.