Assuming this result holds, and knowing that LLMs (including 4o) nevertheless remain incapable of standing in for people in most cases that require intelligence, this seems like a damning indictment of the test as an indicator of genuine intelligence.
One (bad) pet theory I have is that LLMs/AIs are going to uncover something very uncomfortable to us: The difference in intelligence between people is a lot bigger than we thought. In that someone with an IQ of 95 and and IQ of 105 [0] have very different views of the world and very different abilities to navigate that world. Like, some people are much dumber than we thought they were and some people are much smarter. Not sure what the downstream effects of such a theory might be, but I don't like the things I can think up.
Again, a (bad) pet theory.
[0] Yes, IQ is not a good measure of blah blah blah. I'm just using this a handle to explain things, I don't mean it literally.
I think we're gonna find that there are different ways to quantify "humanness" other than IQ. Someone with an IQ of 95 might seem "more real" than an LLM with a computed IQ of 145.
EQ is a much better test at what makes us "human" than IQ. The only reason we don't give it credit is that it makes us even more uncomfortable than IQ.
I mean, yeah. IQ is a bad measure (if self-consistent). Training trumps all, like with every task. The more we do something, the better we'er going to be at it.
The thing that is going to be interesting is now that we have essentially cheap, ethically clear, and realistic digital 'people', what are the experiments that we can do with them and what can we uncover? I'm a little flat-footed even as to the questions that we can ask them now. At the very least, we can use them to 'dry-run' surveys and experiments and have better data collection and stress-testing. Like, you can now generate realistic data now and use that to run the stats while the real surveys are coming in.
Even if your claim is true, how would LLMs/AI lead to uncovering this? I don’t see why they are related, except very tangentially.
I mean they said it was a bad theory.
More seriously, it seems to be essentially the idea that “surpassing human intelligence” is not the binary outcome many thought it would be, and that much of what passes for human intelligence interpersonally could be imitation of intelligence.
Yeah, the impetus comes from the Ashley Madison hacks.
Like, you had thousands of men paying real money to chat with (terrible) bots. To me, that was the passing of the Turing Test. But I know of nearly no person that could possibly fall for that scam. Even family members deep in dementia knew it was a joke. Yet Ashely Madison made a ton of cash.
That, to me, was puzzling. How could it happen that people that are that foolish would be able to hold a job or pay taxes? It made no sense.
So, the (bad) pet theory that I eventually came up with is that human intelligence is a lot wider than we think it is.
Maybe you've discovered that learning pays compound interest.
David Epstien talks about this in Range.
Essentially, we have 'kind' and 'unkind' learning environments.
To be successful in a Kind environment, you drill-and-kill. The feedback is near instant and the ranking is clear. These are things like golf, classical music, and chess.
To be successful in an Unkind environment, you learn as much as you can. The feedback is infrequent and the ranking is murky. These are things like tennis, jazz, and business.
I'd think that the compounding interest only plays in the Unkind environments, as you can make new connections on the new data you've got going in. In the Kind environment, new data doesn't make a difference as you're just trying to be perfect at the thing you're focusing on; if anything it's an impediment.
I think the core idea is reasonably solid. For as long as there's some intellectual capability that humans have and machines don't, it should in theory be possible to use that to distinguish the two. Turing gave the example of feeding in chess moves, for instance.
Just that in 5-minute sessions (which is what Turing suggested, not the fault of this study) with non-experts, the conversations seemed to tend heavily towards brief unchallenging small talk - which GPT-4.5 did well at due to many interrogators being poorly calibrated about LLMs being able to speak informally.
I think it might instead make sense to consider the accuracy of the best interrogator/strategy. Most accurate strategy listed in the paper still gets 75% accuracy for instance, and I'd suspect there are many people well-informed of LLM weaknesses that could reliably exceed even that.
This is a good point. It's really remarkable how many people think ChatGPT's default "voice" is the only thing that can come out of an LLM.
> For as long as there's some intellectual capability that humans have and machines don't
Careful. You're smuggling in an assumption that isn't true. Machine don't have intellectual capabilities, and this follows from what the computer as formal construct is. They can simulate the appearance of intellectual ability, as LLMs can, at least in certain respects, but appearance ought not be conflated with cause.
I don't personally believe that there's anything fundamentally preventing machines from being intelligent in the same way biological life is. Not to say that LLMs currently are.
But, if you want, you can replace "some intellectual capability" with "some capability typically associated with intelligence". Ability to solve unseen logic puzzles, for instance.
The Turing Test does not aim at measuring intelligence. It's about differentiating between human being and machine.
And it depends on the person and their experience of chatbots. People were fooled in the 1960s by ELIZA, the chatbot that mostly just rephrased what the user said as a question (i.e. "I'm afraid of flying." "Why are you afraid of flying?") and people believed it was understanding them.
I recently came across a critique of the Turing test that seems relevant here. Given the test's limited duration (five minutes in this study) and the constrained rate of human communication, it’s theoretically possible to anticipate every possible human response and prepare prewritten replies in advance. If such a giant lookup table successfully deceives the interrogator most of the time, would we then consider it intelligent?
IDK, 70 years is a good long run, it seems to have held up remarkably well.
A lot of its value is that it's intuitively obvious to laypeople.
If you deal in modern machine learning/AI/whatever, you can formulate all sorts of criteria and parameters for an "actually intelligent machine", but it's never going to be as clearcut as "if it quacks like a duck".