It’s kind of crazy to assert that the systems understand chess, and then disclose further down the article that sometimes he failed to get a legal move after 10 tries and had to sub in a random move.
A person who understands chess well (Elo 1800, let’s say) will essentially never fail to provide a legal move on the first try.
He is testing several models, some of which cannot reliably output legal moves. That's different from saying all models including the one he thinks understands can't generate a legal move in 10 tries.
3.5-turbo-instruct's illegal move rate is about 5 or less in 8205
I also wonder what kind of invalid moves they are. There's "you can't move your knight to j9 that's off the board", "there's already a piece there" and "actually that would leave you in check".
I think it's also significantly harder to play chess if you were to hear a sequence of moves over the phone and had to reply with a followup move, with no space or time to think or talk through moves.
What do you mean by "understand chess"?
I think you don't appreciate how good the level of chess displayed here is. It would take an average adult years of dedicated practice to get to 1800.
The article doesn't say how often the LLM fails to generate legal moves in ten tries, but it can't be often or the level of play would be much much much worse.
As seems often the case, the LLM seems to have a brilliant intuition, but no precise rigid "world model".
Of course words like intuition are anthropomorphic. At best a model for what LLMs are doing. But saying "they don't understand" when they can do _this well_ is absurd.
> But saying "they don't understand" when they can do _this well_ is absurd.
When we talk about understanding a simple axiomatic system, understanding means exactly that the entirety of the axioms are modeled and applied correctly 100% of the time. This is chess, not something squishy like literary criticism. There’s no need to debate semantics at all. One illegal move is a deal breaker
Undergraduate CS homework for playing any game with any technique would probably have the stipulation that any illegal move disqualifies the submission completely. Whining that it works most of the time would just earn extra pity/contempt as well as an F on the project.
We can argue whether an error rate of 1 in a million means that it plays like a grandmaster or a novice, but that’s less interesting. It failed to model a simple system correctly, and a much shorter/simpler program could do that. Doesn’t seem smart if our response to this as an industry is to debate semantics, ignore the issue, and work feverishly to put it to work modeling more complicated / critical systems.
You just made up a definition of "understand". According to that definition, you are of course right. I just don't think it's a reasonable definition. It's also contradicted by the person I was replying to in the sibling comment, where they argue that Stockfish doesn't understand chess, despite Stockfish of course having the "axioms" modeled and applied correctly 100% of the time.
Here are things people say:
Magnus Carlsen has a better understanding of chess than I do. (Yet we both know the precise rules of the game.)
Grandmasters have a very deep understanding of Chess, despite occasionally making illegal moves that are not according to the rules (https://www.youtube.com/watch?v=m5WVJu154F0).
"If AlphaZero were a conventional engine its developers would be looking at the openings which it lost to Stockfish, because those indicate that there's something Stockfish understands better than AlphaZero." (https://chess.stackexchange.com/questions/23206/the-games-al...)
> Undergraduate CS homework for playing any game with any technique would probably have the stipulation that any illegal move disqualifies the submission completely. Whining that it works most of the time would just earn extra pity/contempt as well as an F on the project.
How exactly is this relevant to the question whether LLMs can be said to have some understanding of chess? Can they consistently apply the rules when game states are given in pgn? No. _Very_ few humans without specialized training could either (without using a board as a tool to keep track of the implicit state). They certainly "know" the rules (even if they can't apply them) in the sense that they will state them correctly if you ask them to.
I am not particularly interested in "the industry". It's obvious that if you want a system to play chess, you use a chess engine, not an LLM. But I am interested in what their chess abilities teaches us about how LLMs build world models. E.g.:
Thanks for your thoughtful comment and refs to chase down.
> You just made up a definition of "understand". According to that definition, you are of course right. I just don't think it's a reasonable definition. ... Here are things people say:
Fine. As others have pointed out and I hinted at.. debating terminology is kind of a dead end. I personally don't expect that "understanding chess" is the same as "understanding Picasso", or that those phrases would mean the same thing if they were applied to people vs for AI. Also.. I'm also not personally that interested in how performance stacks up compared to humans. Even if it were interesting, the topic of human-equivalent performance would not have static expectations either. For example human-equivalent error rates in AI are much easier for me to expect and forgive in robotics than they are in axiomatic game-play.
> I am interested in what their chess abilities teaches us about how LLMs build world models
Focusing on the single datapoint that TFA is establishing: some LLMs can play some chess with some amount of expertise, with some amount of errors. With no other information at all, this tells us that it failed to model the rules, or it failed in the application of those rules, or both.
Based on that, some questions worth asking: Which one of these failure modes is really acceptable and in which circumstances? Does this failure mode apply to domains other than chess? Does it help if we give it the model directly, say by explaining the rules directly in the prompt and also explicitly stating to not make illegal moves? If it's failing to apply rules, but excels as a model-maker.. then perhaps it can spit out a model directly from examples, and then I can feed the model into a separate engine that makes correct, deterministic steps that actually honor the model?
Saying that LLMs do or don't understand chess is lazy I guess. My basic point is that the questions above and their implications are so huge and sobering that I'm very uncomfortable with premature congratulations and optimism that seems to be in vogue. Chess performance is ultimately irrelevant of course, as you say, but what sits under the concrete question is more abstract but very serious. Obviously it is dangerous to create tools/processes that work "most of the time", especially when we're inevitably going to be giving them tasks where we can't check or confirm "legal moves".
> When we talk about understanding a simple axiomatic system, understanding means exactly that the entirety of the axioms are modeled and applied correctly 100% of the time.
Yes, but then, when we talk about understanding in LLMs, we talk about existence, but not necessarily about determinism.
Remember that, while chess engines are (I guess?) deterministic systems, LLMs are randomized systems. You give the same context and the same prompt multiple times, and each and every time you get a different response.
To me, this, together with the fact that you have an at least 1-in-10 chance of getting a good move (even for strange scenarios), means that understanding _does exist_ inside the LLM. And the problem following from this is, how to force the LLM to reliably choose the right „paths of thought“ (sorry for that metaphor).
> I think you don't appreciate how good the level of chess displayed here is. It would take an average adult years of dedicated practice to get to 1800.
Since we already have programs that can do this, that definitely aren’t really thinking and don’t “understand” anything at all, I don’t see the relevance of this part.
It seems you're shifting the discourse here. In the context of LLMs, "to understand" is short for "to have a world model beyond the pure relations between words". In that sense, chess engines do "understand" chess, as they operate on a world model. You can even say that they don't understand anything but chess, which makes them extremely un-intelligent and definitely not capable of understanding as we mean it.
However, since an LLM is a generalist engine, if it understands chess there is no reason for it not to understand millions of other concepts and how they relate to each other. And this is the kind of understanding that humans do.
I hate the use of words like "understand" in these conversations.
The system understands nothing, it's anthropomorphising it to say it does.
I have the same conclusion, but for the opposite reason.
It seems like many people tend to use the word "understand" to that not only does someone believe that a given move is good, they also belive that this knowledge comes from a rational evaluation.
Some attribute this to a non-material soul/mind, some to quantum mechanics or something else that seems magic, while others never realized the problem with such a belief in the first place.
I would claim that when someone can instantly recognize good moves in a given situation, it doesn't come from rationality at all, but from some mix of memory and an intuition that has been build by playing the game many times, with only tiny elements of actual rational thought sprinkled in.
This even holds true when these people start to calculate. It is primarily their intuition that prevens them from spending time on all sorts of unlikely moves.
And this intuition, I think, represents most of their real "understanding" of the game. This is quite different from understanding something like a mathematical proof, which is almost exclusively inducive logic.
And since "understand" so often is associated with rational inductive logic, I think the proper term would be to have "good intuition" when playing the game.
And this "good intuition" seems to me precisely the kind of thing that is trained within most neural nets, even LLM's. (Q*, AlphaZero, etc also add the ability to "calculate", meaning traverse the search space efficiently).
If we wanted to measure how good this intuition is compared to human chess intuition, we could limit an engine like AlphaZero to only evaluate the same number of moves per second that good humans would be able to, which might be around 10 or so.
Maybe with this limitation, the engine wouldn't currently be able to beat the best humans, but even if it reaches a rating of 2000-2500 this way, I would say it has a pretty good intuitive understanding.
Trying to appropriate perfectly well generalizable terms as "something that only humans do" brings zero value to a conversation. It's a "god in the gaps" argument, essentially, and we don't exactly have a great track record of correctly identifying things that are uniquely human.
There's very literally currently a whole wealth of papers proving that LLMs do not understand, cannot reason, and cannot perform basic kinds of reasoning that even a dog can perform. But, ok.
There's a whole wealth of papers proving that LLMs do not understand the concepts they write about. That doesn't mean they don't understand grammar – which (as I've claimed since the GPT-2 days) we should, theoretically, expect them to "understand". And what is chess, but a particularly sophisticated grammar?
There's very literally currently a whole wealth of papers proving the opposite, too, so ¯\_(ツ)_/¯.
The whole point of this exercise is to understand what "understand" even means. Because we really don't have a good definition for this, and until we do, statements like "the system understands nothing" are vacuous.
Pretty sure elo 1200 will only give legal moves. It's really not hard to make legal moves in chess.
Casual players make illegal moves all the time. The problem isn't knowing how the pieces move. It's that it's illegal to leave your own king in check. It's not so common to accidentally move your king into check, though I'm sure it happens, but it's very common to accidentally move a piece that was blocking an attack on your king.
I would tend to agree that there's a big difference between attempting to make a move that's illegal because of the state of a different region of the board, and attempting to make one that's illegal because of the identity of the piece being moved, but if your only category of interest is "illegal moves", you can't see that difference.
Software that knows the rules of the game shouldn't be making either mistake.
Casual players don’t make illegal moves so often that you have to assign them a random move after 10 goes.