I think the core idea is reasonably solid. For as long as there's some intellectual capability that humans have and machines don't, it should in theory be possible to use that to distinguish the two. Turing gave the example of feeding in chess moves, for instance.
Just that in 5-minute sessions (which is what Turing suggested, not the fault of this study) with non-experts, the conversations seemed to tend heavily towards brief unchallenging small talk - which GPT-4.5 did well at due to many interrogators being poorly calibrated about LLMs being able to speak informally.
I think it might instead make sense to consider the accuracy of the best interrogator/strategy. Most accurate strategy listed in the paper still gets 75% accuracy for instance, and I'd suspect there are many people well-informed of LLM weaknesses that could reliably exceed even that.
This is a good point. It's really remarkable how many people think ChatGPT's default "voice" is the only thing that can come out of an LLM.
> For as long as there's some intellectual capability that humans have and machines don't
Careful. You're smuggling in an assumption that isn't true. Machine don't have intellectual capabilities, and this follows from what the computer as formal construct is. They can simulate the appearance of intellectual ability, as LLMs can, at least in certain respects, but appearance ought not be conflated with cause.
I don't personally believe that there's anything fundamentally preventing machines from being intelligent in the same way biological life is. Not to say that LLMs currently are.
But, if you want, you can replace "some intellectual capability" with "some capability typically associated with intelligence". Ability to solve unseen logic puzzles, for instance.