Do the goalposts have to keep moving until we can no longer find any gap in common knowledge or eccentric behavior in AI? If so, what does that say about eccentric human beings?
Of course; that's the point of an adversarial test, to free the interrogators to use all their human intelligence to place the goalposts wherever they judge best. There will always be individual humans who'd fail any sane version of the test (illiterate, comatose, etc.), so the test is meaningful only as a statistical aggregate.
To me it just sounds like you're holding interrogators to an unreasonably high standard in order to deny the findings of the study. If we're talking about statistical aggregates, knowing that the average person lacks the knowledge to exploit known biases of current AI models is enough to dismiss the expectation that interrogators should target them specifically. Commenters also seem to be missing the fact that this is a situation where the interrogator does not know if they are conversing with an AI model or a human being. I wouldn't expect someone to go all out boxing a punching bag if I told them there's a 50% chance that there's another person trapped in there. I've never seen the Turing Test described in such demanding terms, and a look at the Wikipedia page contradicts the definitions pushed forward here.
Perhaps another name should be coined to describe the level of perfection that critics expect from this. It sounds like what you want is something akin to a comprehensive test for AGI.
If your standard for how hard the interrogator should try isn't "as hard as they can", then what do you propose instead? It's always possible to fool a sufficiently lazy human, so you need something.
> It sounds like what you want is something akin to a comprehensive test for AGI.
Since you mentioned Wikipedia, their first proposed test for AGI is Turing's:
https://en.wikipedia.org/wiki/Artificial_general_intelligenc...
I (generally, not from you) see a motte-and-bailey game, where the strongest versions of Turing's test are described as equivalent to AGI, and then favorable results on weaker versions are used to claim we've achieved it. I think those weaker results are significant, probably in economically important ways, though mostly socially destructive. I think this preprint is mostly good. I don't like that conflation, though.
>To me it just sounds like you're holding interrogators to an unreasonably high standard in order to deny the findings of the study.
There isn't a THE Turing test. On a deep philosophical level, a Turing test is a kind of never ending test for everyone we interact with all the time. I don't want to get too deep in the weeds of philosophy here, but the idea is that we are talking about verifying intelligence in general, just like we verify any scientific theory through replication.
In a very scientific way, it's just another case of perpetual falsifiability. The same way that Newtonian physics is a "fact" until it isn't, an AI passes a Turing test until it doesn't.
Here are some example questions that Turing proposed when initially describing the test:
>"I have K at my K1, and no other pieces. You have only K at K6 and R at R1. It is your move. What do you play?"
>"In the first line of your sonnet which reads "Shall I compare thee to a summer's day," would not "a spring day" do as well or better?"
It seems to me that it isn't a movement of the goalposts to demand that the interrogators are adversarial and as challenging as possible - it's what Turing originally envisioned.
As such, "The AI did not pass the Turing test because the interrogators were not sufficiently challenging" becomes a standard impossible to beat. The reductio on this is that in order for AI to pass the Turing test, it has to fool everyone on the planet which is not what I believe is intended.
Rather, we should set an upper bound on what a reasonable interpretation of "as challenging as possible" means.