The shared reward function from pre-training is like primary school for an LLM. Maybe RLHF is like secondary school. The governor can be differentiated from the workers with different system and user prompts, fine tuning, etc., which might be similar to medical school or law school for a human.
Certainly human judges, attorneys for defense and prosecution, and members of the jury can still perform their jobs well even if they attended the same primary and secondary schools.
I see what you are getting at. My point is that if you train and agent and verifier/governor together based on rewards from e.g. RLVR, the system (agent + governor) is what will reward hack. OpenAI demonstrated this in their "Learning to Reason with CoT" blog post, where they showed that using a model to detect and punish strings associated with reward hacking in the CoT just led the model to reward hack in ways that were harder to detect. Stacking higher and higher order verifiers maybe buys you time, but also increases false negative rates + reward hacking is a stable attractor for the system.