Some of the other issues are less important than others, but even if you accept “you have to take responsibility for yourself”, let me quote the article:
> As mentioned in my multi-agent systems post, LLM-reliability often negatively correlates with the amount of instructional context it’s provided. This is in stark contrast to most users, who (maybe deceived by AI hype marketing) believe that the answer to most of their problems will be solved by providing more data and integrations. I expect that as the servers get bigger (i.e. more tools) and users integrate more of them, an assistants performance will degrade all while increasing the cost of every single request. Applications may force the user to pick some subset of the total set of integrated tools to get around this.
I will rephrase it in stronger terms.
MCP does not scale.
It cannot scale beyond a certain threshold.
It is Impossible to add an unlimited number of tools to your agents context without negatively impacting the capability of your agent.
This is a fundamental limitation with the entire concept of MCP and needs addressing far more than auth problems, imo.
You will see posts like “MCP used to be good but now…” as people experience the effects of having many MCP servers enabled.
They interfere with each other.
This is fundamentally and utterly different from installing a package in any normal package system, where not interfering is a fundamental property of package management in general.
Thats the problem with MCP.
As an idea it is different to what people trivially expect from it.
I think this can largely be solved with good UI. For example, if an MCP or tool gets executed that you didn't want to get executed, the UI should provide an easy way to turn it off or to edit the description of that tool to make it more clear when it should be used and should not be used by the agent.
Also, in my experience, there is a huge bump in performance and real-world usage abilities as the context grows. So I definitely don't agree about a negative correlation there, however, in some use cases and with the wrong contexts it certainly can be true.
I don't think that could be sufficient to solve the problem.
I'm using Gemini with AI Studio and the size of a 1 million token context window is becoming apparent to me. I have a large conversation, multiple paragraphs of text on each side of the conversation, with only 100k tokens or so. Just scrolling through that conversation is a chore where it becomes easier just to ask the LLM what we were talking about earlier rather than try to find it myself.
So if I have several tools, each of them adding 10k+ context to a query, and all of them reasonable tool requests - I still can't verify that it isn't something "you [I] didn't want to get executed" since that is a vague description of the failure states of tools. I'm not going to read the equivalent of a novel for each and every request.
I say this mostly because I think some level of inspectability would be useful for these larger requests. It just becomes impractical at larger and larger context sizes.
> For example, if an MCP or tool gets executed that you didn't want to get executed, the UI should provide an easy way to turn it off or to edit the description of that tool to make it more clear when it should be used and should not be used by the agent.
Might this become more simply implemented as multiple individual calls, possibly even to different AI services, chained together with regular application software?
I don't understand your question at all
If you are saying why have autonomous agents at all and not just workflows, then obviously the answer is that it just depends on the use case. Most of the time workflows that are not autonomous are much better, but not always, and sometimes they will also include autonomous parts in those workflows
Simple: if the choice is getting overwhelming to the LLM, then... divide and conquer - add a tool for choosing tools! Can be as simple as another LLM call, with prompt (ugh, "agent") tasked strictly with selecting a subset of available tools that seem most useful for the task at hand, and returning that to "parent"/"main" "agent".
You kept adding more tools and now the tool-master "agent" is overwhelmed by the amount of choice? Simple! Add more "agents" to organize the tools into categories; you can do that up front and stuff the categorization into a database and now it's a rag. Er, RAG module to select tools.
There are so many ways to do it. Using cheaper models for selection to reduce costs, dynamic classification, prioritizing tools already successfully applied in previous chat rounds (and more "agents" to evaluate if a tool application was successful)...
Point being: just keep adding extra layers of indirection, and you'll be fine.
The problem is that even just having the tools in the context can greatly change the output of the model. So there can be utility in the agent seeing contextually relevant tools (RAG as you mentioned, etc. is better than nothing) and a negative utility in hiding all of them behind a "get_tools" request.
"Sequential thinking" is one that I tried recently because so many people recommend it, and I have never, ever, seen the chatbot actually do anything but write to it. It never follows up any of it's chains of thoughts or refers to it's notes.
In which client and with which LLM are you using it?
I use it in Claude Desktop for the right use case, it's much better than thinking mode.
But, I admit, I haven't tried it in Cursor or with other LLMs yet.
> It is Impossible to add an unlimited number of tools to your agents context without negatively impacting the capability of your agent.
Huh?
MCP servers aren't just for agents, they're for any/all _clients_ that can speak MCP. And capabilities provided by a given MCP server are on-demand, they only incur a cost to the client, and only impact the user context, if/when they're invoked.
> they only incur a cost to the client, and only impact the user context, if/when they're invoked.
Look it up. Look up the cross server injection examples.
I guarantee you this is not true.
An MCP server is at it's heart some 'thing' that provides a set of 'tools' that an LLM can invoke.
This is done by adding a 'tool definition'.
A 'tool definition' is content that goes into the LLM prompt.
That's how it works. How do you imagine an LLM can decide to use a tool? It's only possible if the tool definition is in the prompt.
The API may hide this, but I guarantee you this is how it works.
Putting an arbitrary amount of 3rd party content into your prompts has a direct tangible impact on LLM performance (and cost). The more MCP servers you enable the more you pollute your prompt with tool definitions, and, I assure you, the worse the results are as a result.
Just like pouring any large amount of unrelated crap into your system prompt does.
At a small scale, it's ok; but as you scale up, the LLM performance goes down.
Here's some background reading for you:
https://github.com/invariantlabs-ai/mcp-injection-experiment...
https://docs.anthropic.com/en/docs/build-with-claude/tool-us...
I think makers of LLM “chat bots” like the Claude desktop or cursor have a ways to go when it comes to exposing precisely what the LLM is being promoted.
Because yes, for the LLM to find the MCP servers it needs that info on its prompt. And the software is currently hiding how that information is being exposed. Is it prepended to your own message? Does it put it at the start of the entire context? If yes, wouldn’t real-time changes in tool availability invalidate the entire context? So then does it add it to end of the context window instead?
Like nobody really has this dialed in completely. Somebody needs to make a LLM “front end” that is the raw de-tokenized input and output. Don’t even attempt to structure it. Give me the input blob and output blob.
… I dunno. I wish these tools had ways to do more precise context editing. And more visibility. It would help make more informed choices on what to prompt the model with.
/Ramble mode off.
But slightly more serious; what is the token cost for a MCP tool? Like the llm needs its name, a description, parameters… so maybe like 100 tokens max per tool? It’s not a lot but it isn’t nothing either.
I've recently written a custom MCP server.
> An MCP server is at it's heart some 'thing' that provides a set of 'tools' that an LLM can invoke.
A "tool" is one of several capabilities that a MCP server can provide to its callers. Other capabilities include "prompt" and "resource".
> This is done by adding a 'tool definition'. A 'tool definition' is content that goes into the LLM prompt. That's how it works. How do you imagine an LLM can decide to use a tool? It's only possible if the tool definition is in the prompt.
I think you're using an expansive definition of "prompt" that includes not just the input text as provided by the user -- which is generally what most people understand "prompt" to mean -- but also all available user- and client-specific metadata. That's fine, just want to make it explicit.
With this framing, I agree with you, that every MCP server added to a client -- whether that's Claude.app, or some MyAgent, or whatever -- adds some amount of overhead to that client. But that overhead is gonna be fixed-cost, and paid one-time at e.g. session initialization, not every time per e.g. request/response. So I'm struggling to imagine a situation where those costs are anything other than statistical line noise, compared to the costs of actually processing user requests.
> https://docs.anthropic.com/en/docs/build-with-claude/tool-us...
To be clear, this concept of "tool" is completely unrelated to MCP.
> https://github.com/invariantlabs-ai/mcp-injection-experiment...
I don't really understand this repo or its criticisms. The authors wrote a related blog post https://invariantlabs.ai/blog/whatsapp-mcp-exploited which says (among other things) that
> In this blog post, we will demonstrate how an untrusted MCP server ...
But there is no such thing as "an untrusted MCP server". Every MCP server is assumed to be trusted, at least as the protocol is defined today.
> But that overhead is gonna be fixed-cost, and paid one-time at e.g. session initialization, not every time per e.g. request/response.
I don't work for a foundational model provider, but how do you think the tool definitions get into the LLM? I mean, they aren't fine-tuning a model with your specific tools definitions, right? Your just using OpenAI's base model (or Claude, Gemini, etc.) So at some point the tool definitions have to be added to the prompt. It is just getting added to the prompt auto-magically by the foundation provider. That means it is eating up some context window, just a portion of the context window that is normally reserved for the provider, a section of the final prompt that you don't get to see (or alter).
Again, while I don't work for these companies or implement these features, I cannot fathom how the feature could work unless it was added to every request. And so the original point of the thread author stands.
You're totally right, in that: whatever MCP servers your client is configured to know about, have a set of capabilities, each of which have some kind of definition, all of which need to be provided to the LLM, somehow, in order to be usable.
And you're totally right that the LLM is usually general-purpose, so the MCP details aren't trained or baked-in, and need to be provided by the client. And those details probably gonna eat up some tokens for sure. But they don't necessarily need to be included with every request!
Interactions with LLMs aren't stateless request/response, they're session-based. And you generally send over metadata like what we're discussing here, or user-defined preferences/memory, or etc., as part of session initialization. This stuff isn't really part of a "prompt" at least as that concept is commonly understood.
I think we are confusing the word "prompt" here leading to miscommunication.
There is the prompt that I, as a user, send to OpenAI which then gets used. There there is "prompt" which is being sent to the LLM. I don't know how these things are talked about internally at the company. But they take the "prompt" you send them and add a bunch of extra stuff to it. For example, they add in their own system message and they will add your system message. So you end up with something like <OpenAI system message> + <User system message> + <user prompt>. That creates a "final prompt" that gets sent to the LLM. I'm sure we both agree on that.
With MCP, we are also adding in <tool description> to that final prompt. Again, it seems we are agreed on that.
So the final piece of the argument is, as that "final prompt" (or whatever is the correct term) is growing. It is the size of the provider system prompt, plus the size of the user system prompt, plus the size of the tool description, plus the size of the actual user prompt. You have to pay that "final prompt" cost for each and every request you make.
If the size of the "final prompt" affects the performance of the LLM, such that very large "final prompt" sizes adversely affect performance, than it stands to reason that adding many tool definitions to a request will eventually degrade the LLM performance.
> With MCP, we are also adding in <tool description> to that final prompt. Again, it seems we are agreed on that.
Interactions with a LLM are session-based, when you create a session there is some information sent over _once_ as part of that session construction, that information applies to all interactions made via that session. That initial data includes contextual information, like user preferences, model configuration as specified by your client, and MCP server definitions. When you type some stuff and hit enter that is a user prompt that may get hydrated with some additional stuff before it gets sent out, but it doesn't include any of that initial data stuff provided at the start of the session.
> that information applies to all interactions made via that session
Humm.. maybe you should run an llama.cpp server in debug mode and review the content that goes to the actual LLM; you can do that with the verbose flag or `OLLAMA_DEBUG=1` (if you use ollama).
What you are describing is not how it works.
There is no such thing as an LLM 'session'.
That is a higher level abstraction that sits on top of an API that just means some server is caching part of your prompt and taking some fragment you typed in the UI and combining them on the server side before feeding them to the LLM.
It makes no difference how it is implemented technically.
Fundamentally; any request you make which can invoke tools will be transformed, as some point, into a definition that includes the tool definitions before it is passed to the LLM.
That has a specific, measurable cost on LLM performance as the number of tool definitions go up.
The only solution to that is to limit the number of tools you have enabled; which is entirely possible and reasonable to do, by the way.
My point is that adding more and more and more tools doesn't scale and doesn't work.
It only works when you have a few tools.
If you have 50 MCP servers enabled, your requests are probably degraded.
> There is no such thing as an LLM 'session'.
This matches my understanding too, at least how it works with Open AI. To me, that would explain why there's a 20 or 30 question limit for a conversation, because the necessary context that needs to be sent with each request would necessarily grow larger and larger.