Take a word problem for example. A child will be told the first step is to translate the problem from human language to mathematical notation (symbolic representation), then solve the math (logic).
A human doesn’t use next token prediction to solve word problems.
But the LLM isn't "using next-token prediction" to solve the problem, that's only how it's evaluated.
The "real processing" happens through the various transformer layers (and token-wise nonlinear networks), where it seems as if progressively richer meanings are added to each token. That rich feature set then decodes to the next predicted token, but that decoding step is throwing away a lot of information contained in the latent space.
If language models (per Anthropic's work) can have a direction in latent space correspond to the concept of the Golden Gate Bridge, then I think it's reasonable (albeit far from certain) to say that LLMs are performing some kind of symbolic-ish reasoning.
Anthropic had a vested interest in people thinking Claude is reasoning.
However, in coding tasks I’ve been able to find it directly regurgitating Stack overflow answers (like literally a google search turns up the code).
Giving coding is supposed to be Claude’s strength, and it’s clearly just parroting web data, I’m not seeing any sort of “reasoning”.
LLM may be useful but they don’t think. They’ve already plateaued, and given the absurd energy requirements I think they will prove to be far less impactful than people think.
The claim that Claude is just regurgitating answers from Stackoverflow is not tenable, if you've spent time interacting with it.
You can give Claude a complex, novel problem, and it will give you a reasonable solution, which it will be able to explain to you and discuss with you.
You're getting hung up on the fact that LLMs are trained on next-token prediction. I could equally dismiss human intelligence: "The human brain is just a biological neural network that is adapted to maximize the chance of creating successful offspring." Sure, but the way it solves that task is clearly intelligent.
I’ve literally spent 100s of hours with it. I’m mystified why so many people use the “you’re holding it wrong” explanation when somebody points out real limitations.
You might consider that other people have also spent hundreds of hours with it, and have seen it correctly solve tasks that cannot be explained by regurgitating something from the training set.
I'm not saying that your observations aren't correct, but this is not a binary. It is entirely possible that the tasks you observe the models on are exactly the kind where they tend to regurgitate. But that doesn't mean that it is all they can do.
Ultimately, the question is whether there is a "there" there at all. Even if 9 times out of 10, the model regurgitates, but that one other time it can actually reason, that means that it is capable of reasoning in principle.
When we've spent time with it and gotten novel code, then if you claim that doesn't happen, it is natural to say "you're holding it wrong". If you're just arguing it doesn't happen often enough to be useful to you, that likely depends on your expectations and how complex tasks you need it to carry out to be useful.
In many ways, Claude feels like a miracle to me. I no longer have to stress over semantics or searching for patterns I can recognize and work with, but I’ve never actually coded them myself in that language. Now, I don’t have to waste energy looking up things that I find boring
The LLM isn't solving the problem. The LLM is just predicting the next word. It's not "using next-token prediction to solve a problem". It has no concept of "problem". All it can do is predict 1 (one) token that follows another provided set. That running this in a loop provides you with bullshit (with bullshit defined here as things someone or something says neither with good nor bad intent, but just with complete disregard for any factual accuracy or lack thereof, and so the information is unreliable for everyone) does not mean it is thinking.
All the human brain does is determine how to fire some motor neurons. No, it does not reason.
No, the human brain does not "understand" language. It just knows how to control the firing of neurons that control the vocal chords, in order to maximize an endocrine reward function that has evolved to maximize biological fitness.
I can speak about human brains the same way you speak about LLMs. I'm sure you can spot the problem in my conclusions: just because the human brain is "only" firing neurons, it does actually develop an understanding of the world. The same goes for LLMs and next-word prediction.
I agree with you as far as the current state of LLMs, but I also feel like we humans have preconceived notions of “thought” and “reasoning”, and are a bit prideful of them.
We see the LLM sometimes do sort of well at a whole bunch of tasks. But it makes silly mistakes that seem obvious to us. We say, “Ah ha! So it can’t reason after all”.
Say LLMs get a bit better, to the point they can beat chess grandmasters 55% of the time. This is quite good. Low level chess players rarely ever beat grandmasters, after all. But, the LLM spits out illegal moves sometimes and sometimes blunders nonsensically. So we say, “Ah ha! So it can’t reason after all”.
But what would it matter if it can reason? Beating grandmasters 55% of the time would make it among the best chess players in the world.
For now, LLMs just aren’t that good. They are too error prone and inconsistent and nonsensical. But they are also sort weirdly capable at lots of things in strange inconsistent ways, and assuming they continue to improve, I think they will tend to defy our typical notions of human intelligence.
reating gradmasters 55% of the time is not good. We've been beating almost 100% of grandmasters in the 90's.
And when even llm's that are good at chess run, I had recently read an article about someone who examined them from this aspect. They said that if the llm fails to come up with a legal move within 10 tries, a random move is played instead. And this was common.
Not even a beginner would attempt 10 illegal moves in a row, after their first few games. The state of LLM chess is laughably bad, honestly. You would guess that even if it didn't play well, it'd at least consistently make legal moves. It doesn't even get to that level.
I don't see why this isn't a good model for how human reasoning happens either, certainly as a first-order assumption (at least).
> A human doesn’t use next token prediction to solve word problems.
Of course they do, unless they're particularly conscientious noobs that are able to repeatedly execute the "translate to mathematical notation, then solve the math" algorithm, without going insane. But those people are the exception.
Everyone else either gets bored half-way through reading the problem, or has already done dozens of similar problems before, or both - and jump straight to "next token prediction", aka. searching the problem space "by feels", and checking candidate solutions to sub-problems on the fly.
This kind of methodical approach you mention? We leave that to symbolic math software. The "next token prediction" approach is something we call "experience"/"expertise" and a source of the thing we call "insight".
Indeed. Work on any project that requires humans to carry out largely repetitive steps, and a large part of the problem involves how to put processes around people to work around humans "shutting off" reasoning and going full-on automatic.
E.g. I do contract work on an LLM-related project where one of the systemic changes introduced - in addition to multiple levels of quality checks - is to force to make people input a given sentence word for word followed by a word from a set of 5 or so, and a minority of the submissions get that sentence correct including the final word despite the system refusing to let you submit unless the initial sentence is correct. Seeing the data has been an absolutely shocking indictment of human reasoning.
These are submissions from a pool of people who have passed reasoning tests...
When I've tested the process myself as well, it takes only a handful of steps before the tendency is to "drift off" and start replacing a word here and there and fail to complete even the initial sentence without a correction. I shudder to think how bad the results would be if there wasn't that "jolt" to try to get people back to paying attention.
Keeping humans consistently carrying out a learned process is incredibly hard.
is that based on a vigorous understanding of how humans think, derived from watching people (children) learn to solve word problems? How do thoughts get formed? Because I remember being given word problems with extra information, and some children trying to shove that information into a math equation despite it not being relevant. The "think things though" portion of ChatGPT o1-preview is hidden from us, so even though a o1-preview can solve word problems, we don't know how it internally computes to arrive at that answer. But we do we really know how we do it? We can't even explain consciousness in the first place.