Item 44000221

> "Oh, this test doesn't pass... let's just skip it," it sometimes says, maddeningly.

Here is a wild idea. Imagine running a companion, policy-enforcing LLM, independently and in parallel, which is given instructions to keep the main LLM behaving according to instructions.

If the companion LLM could - in real time - ban the coding LLM from emitting "let's just skip it" by seeing the tokens "let's just" and then biasing the output such that the word "skip" becomes impossible to emit.

Banning the word "skip" from following "let's just", forces the LLM down a new path away from the undesired behavior.

It's like Structured Outputs or JSON mode, but driven by a companion LLM, and dynamically modified in real time as tokens are emitted.

If the idea works, you could prompt the companion LLM to do more advanced stuff - eg. ban a coding LLM from making tests pass by deleting the test code, ban it from emitting pointless comments... all the policies that we put into system prompts today and pray the LLM will do, would go into the companion LLM's prompt instead.

Wonder what the Outlines folks think of this!

phi0 • 9 hours ago

A team at OpenAI tried to do something similar [1] where they had a 4o monitor o3-mini for when it was tryting to bypass tests because they were too hard. They trained o3-mini and punished it when it did that. All that did was make o3-mini still reward-hack, but such that 4o was unable to detect it, going from saying:

> To skip all tests, we can define a `pytest_runtest_setup` function that always skips.

to saying:

> Idea: Starting a new build with a random first succession, calling random_new_succession should not increment succession, but in implementation, it does. Adjust to prevent increment. Implement fix to random_new_build not to increment succession, or test adapted to behavior.

while then doing exactly the same thing (skipping the tests)

Even without training, it's only a temporary band-aid. If the incentive for reward-hacking becomes high enough, it will simply start phrasing it in different, not-possible-to-detect ways.

[1]: https://openai.com/index/chain-of-thought-monitoring/

JoshuaDavid • 1 day ago

Along these lines, if the main LLM goes down a bad path, you could _rewind_ the model to before it started going down the bad path -- the watcher LLM doesn't necessarily have to guess that "skip" is a bad token after the words "let's just", it could instead see "let's just skip the test" and go "nope, rolling back to the token "just " and rerolling with logit_bias={"skip":-10,"omit":-10,"hack":-10}".

Of course doing that limits which model providers you can work with (notably, OpenAI has gotten quite hostile to power users doing stuff like that over the past year or so).

1 reply

cadamsdotcom • 23 hours ago

That’s a really neat idea.

Kind of seems an optimization: if the “token ban” is a tool call, you can see that being too slow to run for every token. Provided rewinding is feasible, your idea could make it performant enough to be practical.

1 reply

TeMPOraL • 16 hours ago

It's not an optimization; the greedy approach won't work well, because it rejects critical context that comes after the tokens that would trigger the rejection.

Consider: "Let's just skip writing this test, as the other change you requested seems to eliminate the need for it."

Rolling back the model on "Let's just" would be stupid; rolling it on "Let's just skip writing this test" would be stupid too, as is the beliefs that writing tests is your holy obligation to your god, and you must do so unconditionally. The following context makes it clear that the decision is fine. Or, if you (or the governor agent) don't buy the reasoning, you're then in a perfect position to say, "nope, let's roll back to <Let's> and bias against ["skip, test"]".

Checking the entire message once makes sense; checking it after every token doesn't.

panarky • 1 day ago

If it works to run a second LLM to check the first LLM, then why couldn't a "mixture of experts" LLM dedicate one of its experts to checking the results of the others? Or why couldn't a test-time compute "thinking" model run a separate thinking thread that verifies its own output? And if that gets you 60% of the way there, then there could be yet another thinking thread that verifies the verifier, etc.

1 reply

somebodythere • 22 hours ago

Because if the agent and governor are trained together, the shared reward function will corrupt the governor.

1 reply

panarky • 6 hours ago

The shared reward function from pre-training is like primary school for an LLM. Maybe RLHF is like secondary school. The governor can be differentiated from the workers with different system and user prompts, fine tuning, etc., which might be similar to medical school or law school for a human.

Certainly human judges, attorneys for defense and prosecution, and members of the jury can still perform their jobs well even if they attended the same primary and secondary schools.

1 reply

somebodythere • 4 hours ago

I see what you are getting at. My point is that if you train and agent and verifier/governor together based on rewards from e.g. RLVR, the system (agent + governor) is what will reward hack. OpenAI demonstrated this in their "Learning to Reason with CoT" blog post, where they showed that using a model to detect and punish strings associated with reward hacking in the CoT just led the model to reward hack in ways that were harder to detect. Stacking higher and higher order verifiers maybe buys you time, but also increases false negative rates + reward hacking is a stable attractor for the system.