MattPalmer1086 6 days ago

No, that particular attack vector goes away. The attack does not, and is kind of fundamental to how these things currently work.

1
tsimionescu 6 days ago

The attack vector is the only relevant thing here. The attack "feeding malicious prompts to an LLM makes it produce malicious output" is a fundamental feature of LLMs, not an attack. It's just as relevant as C's ability to produce malicious effects if you compile and run malicious source code.

MattPalmer1086 6 days ago

Well, that is my point. There is an inbuilt vulnerability in these systems as they do not (and apparently cannot) separate data and commands.

This is just one vector for this, there will be many, many more.

red75prime 6 days ago

LLMs are doing what you train them to do. See for example " The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions " by Eric Wallace et al.

MattPalmer1086 6 days ago

Interesting. Doesn't solve the problem entirely but seems to be a viable strategy to mitigate it somewhat.