> So I personally don't think it shows LLM models can fool humans trying to unmask them
Maybe these used special LLMs that are unrestricted or something but isn't it pretty trivial to get an LLM to output error prompts by asking them to commit crimes or talk about certain topics?
I think priming people to think they might be talking to a human skews the results here because people will be more hesitant to say really wild shit that the LLM can't react appropriately to, if they think they might be talking to a human
I feel like a cash reward would help not only with motivation in the obvious way, but also by giving people social permission to act weird, since the human on the other side will understand that you're doing it to help both of you win the money.
Perhaps the final form of this experiment will always consider the reward value (for results better than chance, since zero effort for $0.5*X is better than full effort than $X), and we could track the increase in the necessary reward to distinguish over time. There might be a casino game in there somewhere, though collusion between human witnesses and interrogators might become a problem as the stakes get high.