simonw 5 days ago

I've been following prompt injection for 2.5 years and until last week I hadn't seen any convincing mitigations for it - the proposed solutions were almost all optimistic versions of "if we train a good enough model it won't get tricked any more", which doesn't work.

What changed is the new CaMeL paper from DeepMind, which notably does not rely on AI models to detect attacks: https://arxiv.org/abs/2503.18813

I wrote my own notes on that paper here: https://simonwillison.net/2025/Apr/11/camel/

1
nrvn 4 days ago

I can't "shake off" the feeling that this whole MCP/LLM thing is moving in the wrong if not the opposite direction. Up until recently we have been dealing with (or striving to build) deterministic systems in the sense that the output of such systems is expected to be the same given the same input. LLMs with all respect to them behave on a completely opposite premise. There is zero guarantee a given LLM will respond with the same output to the same exact "prompt". Which is OK because that's how natural human languages work and LLMs are perfectly trained to mimic human language.

But now we have to contain all the relevant emerging threats via teaching the LLM to translate user queries from natural language to some intermediate structured yet non-deterministic representation(subset of Python in the case of CaMeL), and validate the generated code using the conventional methods (deterministic systems, i.e. CaMeL interpreter) against pre-defined policies. Which is fine on paper but every new component (Q-LLM, interpreter, policies, policy engine) will have its own bouquet of threat vectors to be assessed and addressed.

The idea of some "magic" system translating natural language query into series of commands is nice. But this is one of those moments I am afraid I would prefer a "faster horse" especially for the likes of sending emails and organizing my music collection...