I have the exact same problem with this article that I had with the previous one - the author fails to provide any data on the frequency of illegal moves.
Thus it's impossible to draw any meaningful conclusions. It would be similar to if I claimed that an LLM is an expert doctor, but in my data I've filtered out all of the times it gave incorrect medical advice.
I don't think is super relevant. I mean, it would be interesting (especially if there was a meaningful difference in the number of illegal move attempts between the different approaches, doubly so if that didn't correlate with the performance when illegal moves are removed), but I don't think it really affects the conclusions of the article: picking randomly from the set of legal moves makes for a truly terrible chess player, so clearly the LLMs are bringing something to the party such that sampling from their output performs significantly better. Splitting hairs about the capability of the LLM on its own (i.e. insisting on defining attempts at an illegal move as a game loss for the purposes of rating) seems pretty besides the point.
> It would be similar to if I claimed that an LLM is an expert doctor, but in my data I've filtered out all of the times it gave incorrect medical advice.
Computationally it's trivial to detect illegal moves, so it's nothing like filtering out incorrect medical advice.
> Computationally it's trivial to detect illegal moves
You're strictly correct, but the rules for chess are infamously hard to implement (as anyone who's tried to write a chess program will know), leading to minor bugs in a lot of chess programs.
For example, there's this old myth about vertical castling being allowed due to ambiguity in the ruleset: https://www.futilitycloset.com/2009/12/11/outside-the-box/ (Probably not historically accurate).
If you move beyond legal positions into who wins when one side flags, the rules state that the other side should be awarded a victory if checkmate was possible with any legal sequence of moves. This is so hard to check that no chess program tries to implement it, instead using simpler rules to achieve a very similar but slightly more conservative result.
That link was new too me, thanks! However: I wrote some chess-program myself (nothing big, hobby level) and I would not call it hard to implement. Just harder than what someone might assume initially. But in the end, it is one of the simpler simulations/algorithms I did. It is just the state of the board, the state of the game (how many turns, castle rights, past positions for the repetition rule, ...) and picking one rule set if one really wants to be exact.
(thinking about which rule set is correct would not be meaningful in my opinion - chess is a social construct, with only parts of it being well defined. I would not bother about the rest, at least not when implementing it)
By the way: I read "Computationally it's trivial" as more along the lines of "it has been done before, it is efficient to compute, one just has to do it" versus "this is new territory, one needs to come up with how to wire up the LLM output with an SMT solver, and we do not even know if/how it will work."
> You're strictly correct, but the rules for chess are infamously hard to implement
Come on. Yeah they're not trivial but they've been done numerous times. There's been chess programs for almost as long as there have been computers. Checking legal moves is a _solved problem_.
Detecting valid medical advice is not. The two are not even remotely comparable.
> Detecting valid medical advice is not. The two are not even remotely comparable.
Uh? Where exactly did I signal my support for LLM's giving medical advice?
We implemented a whole chess engine in lisp during 3rd year it was really trivial actually implementing the legal move/state checking.
I got a kick out of that link. Had certainly never heard of "vertical castling" previously.
As I wrote in another comment - you can write scripts that correct bad math, too. But we don't use that to claim that LLMs have a good understanding of math.
I'd say that's because we don't understand what we mean by "understand".
Hardware that accurately performs maths faster than all of humanity combined is so cheap as to be disposable, but I've yet to see anyone claim that a Pi Zero has "understanding" of anything.
An LLM can display the viva voce approach that Turing suggested[0], and do it well. Ironically for all those now talking about "stochastic parrots", the passage reads:
"""… The game (with the player B omitted) is frequently used in practice under the name of viva voce to discover whether some one really understands something or has ‘learnt it parrot fashion’. …"
Showing that not much has changed on the philosophy of this topic since it was invented.
[0] https://academic.oup.com/mind/article/LIX/236/433/986238
> I'd say that's because we don't understand what we mean by "understand".
I'll have a stab at it. The idea of LLMs 'understanding' maths is that, once having been trained on a set of maths-related material, the LLM will be able to generalise to solve other maths problems that it hasn't encountered before. If an LLM sees all the multiplication tables up to 10x10, and then is correctly able to compute 23x31, we might surmise that it 'understands' multiplication - i.e. that it has built some generalised internal representation of what multiplication is, rather than just memorising all possible answers. Obviously we don't expect generalisation from a Pi Zero without specifically being coded for it, because it's a fixed function piece of hardware.
Personally I think this is highly unlikely given that maths and natural language are very different things, and being good at the latter does not bear any relationship to being good at the former (just ask anyone who struggles with maths - plenty of people do!). Not to mention that it's also much easier to test for understanding of maths because there is (usually!) a single correct answer regardless of how convoluted the problem - compared to natural language where imitation and understanding are much more difficult to tell apart.
I don't know. I have talked to a few math professors, and they think LLMs are as good as a lot of their peers when it comes hallucinations and being able to discuss ideas on very niche topics, as long as the context is fed in. If Tao is calling some models "a mediocre, but not completely incompetent [...] graduate student", then they seem to understand math to some degree to me.
Tao said that about a model brainstorming ideas that might be useful, not explaining complex ideas or generating new ideas or selecting a correct idea from a list of brainstormed ideas. Not replacing a human.
> Not replacing a human.
Obviously not, but that is tangential to this discussion, I think. A hammer might be a useful tool in certain situations, and surely it does not replace a human (but it might make a human in those situations more productive, compared to a human without a hammer).
> generating new ideas
Is brainstorming not an instance of generating new ideas? I would strongly argue so. And whether the LLM does "understand" (or whatever ill-defined, ill-measurable concept one wants to use here) anything about the ideas if produces, and how they might be novel - that is not important either.
If we assume that Tao is adequately assessing the situation and truthfully reporting his findings, then LLMs can, at the current state, at least occasionally be useful in generating new ideas, at least in mathematics.
Being as good as a professor at confidently hallucinating nonsense when you don't know the answer is a very high level skill.
Actually, LLMs do call scripts that correct bad math, and have gotten progressively better because of it. It's another special case example.
Don't think that analogy works unless you could write a script that automatically removes incorrect medical advice, because then you would indeed have an LLM-with-a-script that was an expert doctor (which you can do for illegal chess move, but obviously not for evaluating medical advice)
You can write scripts that correct bad math, too. In fact most of the time ChatGPT will just call out to a calculator function. This is a smart solution, and very useful for end users! But, still, we should not try to use that to make the claim that LLMs have a good understanding of math.
If a script were applied that corrected "bad math" and now the LLM could solve complex math problems that you can't one-shot throw at a calculator, what would you call it?
It's a good point.
But this math analogy is not quite appropriate: there's abstract math and arithmetic. A good math practitioner (LLM or human) can be bad at arithmetic, yet good at abstract reasoning. The later doesn't (necessarily) requires the former.
In chess, I don't think that you can build a good strategy if it relies on illegal moves, because tactics and strategies are tied.
If I had wings, I'd be a bird.
Applying a corrective script to weed out bad answers is also not "one-shot" solving anything, so I would call your example an elaborate guessing machine. That doesn't mean it's not useful, but that's not how a human being does maths, when they understand what they're doing - in fact you can readily program a computer to solve general maths problems correctly the first time. This is also exactly the problem with saying that LLMs can write software - a series of elaborate guesses is undeniably useful and impressive, but without a corrective guiding hand, ultimately useless, and not demonastrating generalised understanding of the problem space. The dream of AI is surely that the corrective hand is unnecessary?
Then you could replace the LLM with a much cheaper RNG and let it guess until the "bad math filter" let something through.
I was once asked by one of the Clueless Admin types if we couldn't just "fix" various sites such that people couldn't input anything wrong. Same principle.
Agreed. It's not the same thing and we should strive for precision (LLMs are already opaque enough as it is).
An LLM that recognizes an input as "math" and calls out to a NON-LLM to solve the problem vs an LLM that recognizes an input as "math" and also uses next-token prediction to produce an accurate response ARE DIFFERENT.
At what point does "knows how to use a calculator" equate to knowing how to do math? Feels pretty close to me...
Well, LLMs are bad at math but they're ok at detecting math and delegating it to a calculator program.
It's kind of like humans.
It would be possible to employ an expert doctor, instead of writing a script.
Which is cheaper:
1. having a human expert creating every answer
or
2. having an expert check 10 answers each of which have a 90% chance of being right and then manually redoing the one which was wrong
Now add a complications that:
• option 1 also isn't 100% correct
• nobody knows which things in option 2 are correlated or not and if those are or aren't correlated with human errors so we might be systematically unable to even recognise the errors
• even if we could, humans not only get lazy without practice but also get bored if the work is too easy, so a short-term study in efficiency changes doesn't tell you things like "after 2 years you get mass resignations by the competent doctors, while the incompetent just say 'LGTM' to all the AI answers"
3-turbo-instruct makes about 5 or less illegal moves in 8205. It's not here but turbo instruct has been evaled before.
> It would be similar to if I claimed that an LLM is an expert doctor, but in my data I've filtered out all of the times it gave incorrect medical advice
Sharp eyes, similarly Andrew Ng and his Stanford University team pulled the same trick by having overfitting training to testing ratio for his famous cardiologist-level paper published in Nature Medicine [1].
The training ratio is more than 99% and testing less than 1% which failed AI validation 101. The paper would not stand in most AI conference but being published in Nature Medicine, one of the highest impact factor journal there is and has many citations for AI in healthcare and medicine.
[1] Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network:
Correct - Dynamic grammar based/constrained sampling can be used to, at each time-step, force the model to only make valid moves (and you don't have to do it in the prompt like this article does!!!)
I have NO idea why no one seems to do this. It's a similar issue with LLM-as-judge evaluations. Often they are begging to be combined with grammar based/constrained/structured sampling. So much good stuff in LLM land isn't used for no good reason! There are several libraries for implementing this easily, outlines, guidance, lm-format-enforcer, and likely many more. You can even do it now with OpenAI!
Oobabooga text gen webUI literally has chess as one of it's candidate examples of grammar based sampling!!!
This is a crazy goal-post move. TFA is proving a positive capability, and rejecting the null hypothesis that “LLMs can’t think they just regurgitate”.
Making some illegal moves doesn’t invalidate the demonstrated situational logic intelligence required to play at ELO 1800.
(Another angle: a human on Chess.com also has any illegal move they try to make ignored, too.)
> Making some illegal moves doesn’t invalidate the demonstrated situational logic intelligence
That’s exactly what it does. 1 illegal move in 1 million or 100 million or any other sample size you want to choose means it doesn’t understand chess.
People in this thread are really distracted by the medical analogy so I’ll offer another: you’ve got a bridge that allows millions of vehicles to cross, and randomly falls down if you tickle it wrong, maybe a car of rare color. One key aspect of bridges is that they work reliably for any vehicle, and once they fail they don’t work with any vehicle. A bridge that sometimes fails and sometimes doesn’t isn’t a bridge as much as a death trap.
>1 illegal move in 1 million or 100 million or any other sample size you want to choose means it doesn’t understand chess
Highly rated chess players make illegal moves. It's rare but it happens. They don't understand chess ?
> Then no human understands chess
Humans with correct models may nevertheless make errors in rule applications. Machines are good at applying rules, so when they fail to apply rules correctly, it means they have incorrect, incomplete, or totally absent models.
Without using a word like “understands” it seems clear that the same apparent mistake has different causes.. and model errors are very different from model-application errors. In a math or physics class this is roughly the difference between carry-the-one arithmetic errors vs using an equation from a completely wrong domain. The word “understands” is loaded in discussion of LLMs, but everyone knows which mistake is going to get partial credit vs zero credit on an exam.
>Humans with correct models may nevertheless make errors in rule applications. Ok
>Machines are good at applying rules, so when they fail to apply rules correctly, it means they have incorrect or incomplete models.
I don't know why people continue to force the wrong abstraction. LLMs do not work like 'machines'. They don't 'follow rules' the way we understand normal machines to 'follow rules'.
>so when they fail to apply rules correctly, it means they have incorrect or incomplete models.
Everyone has incomplete or incorrect models. It doesn't mean we always say they don't understand. Nobody says Newton didn't understand gravity.
>Without using a word like “understands” it seems clear that the same apparent mistake has different causes.. and model errors are very different from model-application errors.
It's not very apparent no. You've just decided it has different causes because of preconceived notions on how you think all machines must operate in all configurations.
LLMs are not the logic automatons in science fiction. They don't behave or act like normal machines in any way. The internals run some computations to make predictions but so does your nervous system. Computation is substrate-independent.
I don't even know how you can make this distinction without seeing what sort of illegal moves it makes. If it makes the sort high rated players make then what ?
I can’t tell if you are saying the distinction between model errors and model-application errors doesn’t exist or doesn’t matter or doesn’t apply here.
I'm saying:
- Generally, we do not say someone does not understand just because of a model error. The model error has to be sufficiently large or the model sufficiently narrow. No-one says Newton didn't understand gravity just because his model has an error in it but we might say he didn't understand some aspects of it.
- You are saying the LLM is making a model error (rather than an an application error) only because of preconceived notions of how 'machines' must behave, not on any rigorous examination.
Suppose you're right, the internal model of game rules is perfect but the application of the model for next-move is imperfect. Unless we can actually separate the two, does it matter? Functionally I mean, not philosophically. If the model was correct, maybe we could get a useful version of it out by asking it to write a chess engine instead of act as a chess engine. But when the prolog code for that is as incorrect as the illegal chess move was, will you say again that the model is correct, but the usage of it resulted merely resulted in minor errors?
> You are saying the LLM is making a model error (rather than an an application error) only because of preconceived notions of how 'machines' must behave, not on any rigorous examination.
Here's an anecdotal examination. After much talk about LLMs and chess, and math, and formal logic here's the state of the art, simplified from dialog with gpt today:
> blue is red and red is blue. what color is the sky? >> <blah blah, restates premise, correctly answer "red">
At this point fans rejoice, saying it understands hypotheticals and logic. Dialogue continues..
> name one red thing >> <blah blah, restates premise, incorrectly offers "strawberries are red">
At this point detractors rejoice, declare that it doesn't understand. Now the conversation devolves into semantics or technicalities about prompt-hacks, training data, weights. Whatever. We don't need chess. Just look it, it's broken as hell. Discussing whether the error is human-equivalent isn't the point either. It's broken! A partially broken process is no solid foundation to build others on. And while there are some exceptions, an unreliable tool/agent is often worse than none at all.
>It's broken! A partially broken process is no solid foundation to build others on. And while there are some exceptions, an unreliable tool/agent is often worse than none at all.
Are humans broken ? Because our reasoning is a very broken process. You say it's no solid foundation ? Take a look around you. This broken processor is the foundation of society and the conveniences you take for granted.
The vast vast majority of human history, there wasn't anything even remotely resembling a non-broken general reasoner. And you know the funny thing ? There still isn't. When people like you say LLMs don't reason, they hold them to a standard that doesn't exist. Where is this non-broken general reasoner in anywhere but fiction and your own imagination?
>And while there are some exceptions, an unreliable tool/agent is often worse than none at all.
Since you are clearly meaning unreliable to be 'makes no mistake/is not broken' then no human is a reliable agent. Clearly, the real exception is when an unreliable agent is worse than nothing at all.
This feels more like a metaphysical argument about what it means to "know" something, which is really irrelevant to what is interesting about the article.
> Machines are good at applying rules, so when they fail to apply rules correctly, it means they have incorrect, incomplete, or totally absent models.
That's assuming that, somehow, a LLM is a machine. Why would you think that?
Replace the word with one of your own choice if that will help us get to the part where you have a point to make?
I think we are discussing whether LLMs can emulate chess playing machines, regardless of whether they are actually literally composed of a flock of stochastic parrots..
That's simple logic. Quoting you again:
> Machines are good at applying rules, so when they fail to apply rules correctly, it means they have incorrect, incomplete, or totally absent models.
If this line of reasoning applies to machines, but LLMs aren't machines, how can you derive any of these claims?
"A implies B" may be right, but you must first demonstrate A before reaching conclusion B..
> I think we are discussing whether LLMs can emulate chess playing machines
That is incorrect. We're discussing whether LLMs can play chess. Unless you think that human players also emulate chess playing machines?
Engineers really have a hard time coming to terms with probabilistic systems.
Try giving a random human 30 chess moves and ask them to make a non-terrible legal move. Average humans even quite often try to make illegal moves when clearly seeing the board before them. There are even plenty of cases where people reported a bug because the chess application didn't let them do an illegal move they thought was legal.
And the sudden comparison to something that's safety critical is extremely dumb. Nobody said we should tie the LLM to a nuclear bomb that explodes if it makes a single mistake in chess.
The point is that it plays at a level far far above making random legal moves or even average humans. To say that that doesn't mean anything because it's not perfect is simply insane.
> And the sudden comparison to something that's safety critical is extremely dumb. Nobody said we should tie the LLM to a nuclear bomb that explodes if it makes a single mistake in chess.
But it actually is safety critical very quickly whenever you say something like “works fine most of the time, so our plan going forward is to dismiss any discussion of when it breaks and why”.
A bridge failure feels like the right order of magnitude for the error rate and effective misery that AI has already quietly caused with biased models where one in a million resumes or loan applications is thrown out. And a nuclear bomb would actually kill less people than a full on economic meltdown. But I’m sure no one is using LLMs in finance at all right?
It’s so arrogant and naive to ignore failure modes that we don’t even understand yet.. at least bridges and steel have specs. Software “engineering” was always a very suspect name for the discipline but whatever claim we had to it is worse than ever.
It's not a goalpost move. As I've already said, I have the exact same problem with this article as I had with the previous one. My goalposts haven't moved, and my standards haven't changed. Just provide the data! How hard can it be? Why leave it out in the first place?
> Thus it's impossible to draw any meaningful conclusions. It would be similar to if I claimed that an LLM is an expert doctor, but in my data I've filtered out all of the times it gave incorrect medical advice.
Not really, you can try to make illegal moves in chess, and usually, you are given a time penalty and get to try again, so even in a real chess game, illegal moves are "filtered out".
And for the "medical expert" analogy, let's say that you compare to systems based on the well being of the patients after they follow the advise. I think it is meaningful even if you filter out advise that is obviously inapplicable, for example because it refers to non-existing body parts.
I want to see graphs of moves the author randomly made too. Maybe even plotting a random-move player on the performance graphs vs. the AIs.
It's beginner chess and beginners make moves at random all the time.
1750 elo is extremely far from beginner chess. The random mover bot on Lichess has like 700 rating.
And the article does show various graphs of the badly playing models which will hardly play worse than random but are clearly far below the good models.
There's a subtle distinction though; if you're able to filter out illegal behavior, the move quality conditioned on legality can be extremely different from arbitrary move quality (and, as you might see in LLM json parsing, conditioning per-token can be very different from conditioning per-response).
If you're arguing that the singularity already happened then your criticism makes perfect sense; these are dumb machines, not useful yet for most applications. If you just want to use the LLM as a tool though, the behavior when you filter out illegal responses (assuming you're able to do so) is the only reasonable metric.
Analogizing to a task I care a bit about: Current-gen LLMs are somewhere between piss-poor and moderate at generating recipes. With a bit of prompt engineering most recipes pass my "bar", but they're still often lacking in one or more important characteristics. If you do nothing other than ask it to generate many options and then as a person manually filter to the subset of ideas (around 1/20) which look stellar, it's both very effective at generating good recipes, and they're usually much better than my other sources of stellar recipes (obviously not generally applicable because you have to be able to tell bad recipes from good at a glance for that workflow to make sense). The fact that most of the responses are garbage doesn't really matter; it's still an improvement to how I cook.
When I play chess I filter out all kinds of illegal moves. I also filter out bad moves. Human is more like "recursively thinking of ideas and then evaluating them with another part of your model", why not let the LLMs do the same?
Because that’s not what happens? We learn through symbolic meaning and rules which then form a consistent system. Then we can have a goal and continuously evaluate if we’re within the system and transitionning towards that goal. The nice thing is that we don’t have to compute the whole simulation in our brains and can start again from the real world. The more you train, the better your heuristics become and the more your efficiency increases.
The internal model of a LLM is statistical text. Which is linear and fixed. Not great other than generating text similar to what was ingested.
> The internal model of a LLM is statistical text. Which is linear and fixed. Not great other than generating text similar to what was ingested.
The internal model of a CPU is linear and fixed. Yet, a CPU can still generate an output which is very different from the input. It is not a simple lookup table, instead it executes complex algorithms.
An LLM has large amounts of input processing power. It has a large internal state. It executes "cycle by cycle", processing the inputs and internal state to generate output data and a new internal state.
So why shouldn't LLMs be capable of executing complex algorithms?
It probably can, but how will those algorithms be created? And the representation of both input and output. If it’s text, the most efficient way is to construct a formal system. Or a statistical model if ambiguous and incorrect result are ok in the grand scheme of things.
The issue is always inout consumption, and output correctness. In a CPU, we take great care with data representation and protocol definition, then we do formal verification on the algorithms, and we can be pretty sure that the output are correct. So the issue is that the internal model (for a given task) of LLMs are not consistent enough and the referential window (keeping track of each item in the system) is always too small.
Neural networks can be evolved to do all sorts of algorithms. For example, controlling an inverted pendulum so that it stays balanced.
> In a CPU, we take great care with data representation and protocol definition, then we do formal verification on the algorithms, and we can be pretty sure that the output are correct.
Sure, intelligent design makes for a better design in many ways.
That doesn't mean that an evolved design doesn't work at all, right?
>The internal model of a LLM is statistical text. Which is linear and fixed.
Not at all. Like seriously, not in the slightest.
What does it encode? Images? Scent? Touch? Some higher dimensional qualia?
Well, a simple description is that they discover circuits that reproduce the training sequence. It turns out that in the process of this, they recover relevant computational structures that generalize the training sequence. The question of how far they generalize is certainly up for debate. But you can't reasonably deny that they generalize to a certain degree. After all, most sentences they are prompted on are brand new and they mostly respond sensibly.
Their representation of the input is also not linear. Transformers use self-attention which relies on the softmax function, which is non-linear.
I world argue that it's more akin to filtering out the chit-chat with the patient, where the doctor explained things in an imprecise manner, keeping only the formal and valid medical notation
There is no legitimate reason to make an illegal move in chess though? There are reasons why a good doctor might intentionally explain things imprecisely to a patient.
> There is no legitimate reason to make an illegal move in chess though?
If you make an illegal move and the opponent doesn't notice it, you gain a significant advantage. LLMs just have David Sirlin's "Playing to Win" as part of their training data.
You raise an interesting point. If the filtered out illegal moves were disadvantageous, it could be that if the model had been allowed to make any moves it wanted it would have played to a much worse level than it did.
It’s like the doctor saying, “you have cancer? Oh you don’t? Just kidding. Parkinson’s. Oh it’s not that either? How about common cold?”
Big the difference is that valid bad moves (equivalents of "cancer") were included in the analysis, it's only invalid ones (like "your body is kinda outgrowing itself") that were excluded from the analysis
What makes a chess move invalid is the state of the board. I don’t think moves like “pick up the pawn and throw it across the room” were considered.
That's a valid move in Monopoly though. Although it's much prefered to pick up the table and throw it.