Thank you for posting this PaulPauls,
can I please ask a wacky question that I have about mech.interp. ?
we know that when we use a metric to measure humans, we game the metric, I wonder about future ai, gaming mech.interp.
for simplicity let me propose a 2d matrix to encode tokens, and hope my understanding of neural networks is good enough for this to be a valid question I understand that higher dimensions have a lot more 'space', so accidental collisions are easily avoided, but 2d makes my example easier to explain.
if we had the following tokens mapped in 2d space
Apple 1a
Pear 1b
Donkey 2a
Horse 2b
it becomes impossible to understand if the neurons 1,2,a,b, all excited represents apple and horse or donkey and pear?I wonder if a future, vastly more competent AI overseeing its own training might use this form of ambiguity as means to create a model in which the weights are selected so the potential for encoding collisions exist, so that it is possible to deceive an mech.int. observer as to the true thoughts of the model, essentially enabling the ai to think in euphemisms?
What you propose is a harder AI safety scenario.
You don't need a 'vastly more competent AI overseeing its own training' to elicit this potential problem, just a malicious AI researcher, looking for (e.g.) a model that's racist but that does not have any interperable activation patterns that identifiably correspond to racism.
The work here on this Show HN suggests that this kind of adversarial training might just barely be possible for a sufficiently-funded individual, and it seems like novel results would be very interesting.