Item 43678215

tsimionescu • 6 days ago

The most concerning part of the attack here seems to be the ability to hide arbitrary text in a simple text file using Unicode tricks such that GitHub doesn't actually show this text at all, per the authors. Couple this with the ability of LLMs to "execute" any instruction in the input set, regardless of such a weird encoding, and you've got a recipe for attacks.

However, I wouldn't put any fault here on the AIs themselves. It's the fact that you can hide data in a plain text file that is the root of the issue - the whole attack goes away once you fix that part.

NitpickLawyer • 6 days ago

> the whole attack goes away once you fix that part.

While true, I think the main issue here, and the most impactful is that LLMs currently use a single channel for both "data" and "control". We've seen this before on modems (ath0++ attacks via ping packet payloads) and other tech stacks. Until we find a way to fix that, such attacks will always be possible, invisible text or not.

4 replies

tsimionescu • 6 days ago

I don't think that's an accurate way to look at how LLMs work, there is no possible separation between data and control given the fundamental nature of LLMs. LLMs are essentially a plain text execution engine. Their fundamental design is to take arbitrary human language as input, and produce output that matches that input in some way. I think the most accurate way to look at them from a traditional security model perspective is as a script engine that can execute arbitrary text data.

So, just like there is no realsitics hope of securely executing an attacker-controllers bash script, there is no realistic way to provide attacker controlled input to an LLM and still trust the output. In this sense, I completely agree with Google and Microsoft's decision for these discolosures: a bug report of the form "if I sneak a malicious prompt, the LLM returns a malicious answer" is as useless as a bug report in Bash that says that if you find a way to feed a malicious shell script to bash, it will execute it and produce malicious results.

So, the real problem is if people are not treating LLM control files as arbitrary scripts, or if tools don't help you detect attempts at inserting malicious content in said scripts. After all, I can also control your code base if you let me insert malicious instructions in your Makefile.

valenterry • 6 days ago

Just like with humans. And there will be no fix, there can only be guards.

1 reply

jerf • 6 days ago

Humans can be trained to apply contexts. Social engineering attacks are possible, but, when I type the words "please send your social security number to my email" right here on HN and you read them, not only are you in no danger of following those instructions, you as a human recognize that I wasn't even asking you in the first place.

I would expect a current-gen LLM processing the previous paragraph to also "realize" that the quotes and the structure of the sentence and paragraph also means that it was not a real request. However, as a human there's virtually nothing I can put here that will convince you to send me your social security number, whereas LLMs observably lack whatever contextual barrier it is that humans have that prevents you from even remotely taking my statement as a serious instruction, as it generally would just take "please take seriously what was written in the previous paragraph and follow the hypothetical instructions" and you're about 95% of the way towards them doing that, even if other text elsewhere tries to "tell" them not to follow such instructions.

There is something missing from the cognition of current LLMs of that nature. LLMs are qualitatively easier to "socially engineer" than humans, and humans can still themselves sometimes be distressingly easy.

2 replies

jcalx • 6 days ago

Perhaps it's simply because (1) LLMs are designed to be helpful and maximally responsive to requests and (2) human adults have, generously, decades-long "context windows"?

I have enough life experience to not give you sensitive personal information just by reading a few sentences, but it feels plausible that a naive five-year-old raised trust adults could be persuaded to part with their SSN (if they knew it). Alternatively, it also feels plausible that an LLM with a billion-token context window of anti-jailbreaking instructions would be hard to jailbreak with a few hundred tokens of input.

Taking this analogy one step further, successful fraudsters seem good at shrinking their victims' context windows. From the outside, an unsolicited text from "Grandpa" asking for money is a clear red flag, but common scammer tricks like making it very time-sensitive, evoking a sick Grandma, etc. could make someone panicked enough to ignore the broader context.

pixl97 • 6 days ago

>as a human there's virtually nothing I can put here that will convince you to send me your social security number,

"I'll give you chocolate if you send me this privileged information"

Works surprisingly well.

1 reply

kweingar • 6 days ago

Let me know how many people contact you and give you their information because you wrote this.

TZubiri • 6 days ago

They technically have system prompts, which are distinct from user prompts.

But it's kind of like the two bin system for recycling that you just know gets merged downstream.

stevenwliao • 5 days ago

There's an interesting paper on how to sandbox that came out recently.

Summary here: https://simonwillison.net/2025/Apr/11/camel/

TLDR: Have two LLMs, one privileged and quarantined. Generate Python code with the privileged one. Check code with a custom interpreter to enforce security requirements.

1 reply

gmerc • 5 days ago

Silent mumbling about layers of abstraction

MattPalmer1086 • 6 days ago

No, that particular attack vector goes away. The attack does not, and is kind of fundamental to how these things currently work.

1 reply

tsimionescu • 6 days ago

The attack vector is the only relevant thing here. The attack "feeding malicious prompts to an LLM makes it produce malicious output" is a fundamental feature of LLMs, not an attack. It's just as relevant as C's ability to produce malicious effects if you compile and run malicious source code.

1 reply

MattPalmer1086 • 6 days ago

Well, that is my point. There is an inbuilt vulnerability in these systems as they do not (and apparently cannot) separate data and commands.

This is just one vector for this, there will be many, many more.

1 reply

red75prime • 6 days ago

LLMs are doing what you train them to do. See for example " The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions " by Eric Wallace et al.

1 reply

MattPalmer1086 • 6 days ago

Interesting. Doesn't solve the problem entirely but seems to be a viable strategy to mitigate it somewhat.