> But saying "they don't understand" when they can do _this well_ is absurd.
When we talk about understanding a simple axiomatic system, understanding means exactly that the entirety of the axioms are modeled and applied correctly 100% of the time. This is chess, not something squishy like literary criticism. There’s no need to debate semantics at all. One illegal move is a deal breaker
Undergraduate CS homework for playing any game with any technique would probably have the stipulation that any illegal move disqualifies the submission completely. Whining that it works most of the time would just earn extra pity/contempt as well as an F on the project.
We can argue whether an error rate of 1 in a million means that it plays like a grandmaster or a novice, but that’s less interesting. It failed to model a simple system correctly, and a much shorter/simpler program could do that. Doesn’t seem smart if our response to this as an industry is to debate semantics, ignore the issue, and work feverishly to put it to work modeling more complicated / critical systems.
You just made up a definition of "understand". According to that definition, you are of course right. I just don't think it's a reasonable definition. It's also contradicted by the person I was replying to in the sibling comment, where they argue that Stockfish doesn't understand chess, despite Stockfish of course having the "axioms" modeled and applied correctly 100% of the time.
Here are things people say:
Magnus Carlsen has a better understanding of chess than I do. (Yet we both know the precise rules of the game.)
Grandmasters have a very deep understanding of Chess, despite occasionally making illegal moves that are not according to the rules (https://www.youtube.com/watch?v=m5WVJu154F0).
"If AlphaZero were a conventional engine its developers would be looking at the openings which it lost to Stockfish, because those indicate that there's something Stockfish understands better than AlphaZero." (https://chess.stackexchange.com/questions/23206/the-games-al...)
> Undergraduate CS homework for playing any game with any technique would probably have the stipulation that any illegal move disqualifies the submission completely. Whining that it works most of the time would just earn extra pity/contempt as well as an F on the project.
How exactly is this relevant to the question whether LLMs can be said to have some understanding of chess? Can they consistently apply the rules when game states are given in pgn? No. _Very_ few humans without specialized training could either (without using a board as a tool to keep track of the implicit state). They certainly "know" the rules (even if they can't apply them) in the sense that they will state them correctly if you ask them to.
I am not particularly interested in "the industry". It's obvious that if you want a system to play chess, you use a chess engine, not an LLM. But I am interested in what their chess abilities teaches us about how LLMs build world models. E.g.:
Thanks for your thoughtful comment and refs to chase down.
> You just made up a definition of "understand". According to that definition, you are of course right. I just don't think it's a reasonable definition. ... Here are things people say:
Fine. As others have pointed out and I hinted at.. debating terminology is kind of a dead end. I personally don't expect that "understanding chess" is the same as "understanding Picasso", or that those phrases would mean the same thing if they were applied to people vs for AI. Also.. I'm also not personally that interested in how performance stacks up compared to humans. Even if it were interesting, the topic of human-equivalent performance would not have static expectations either. For example human-equivalent error rates in AI are much easier for me to expect and forgive in robotics than they are in axiomatic game-play.
> I am interested in what their chess abilities teaches us about how LLMs build world models
Focusing on the single datapoint that TFA is establishing: some LLMs can play some chess with some amount of expertise, with some amount of errors. With no other information at all, this tells us that it failed to model the rules, or it failed in the application of those rules, or both.
Based on that, some questions worth asking: Which one of these failure modes is really acceptable and in which circumstances? Does this failure mode apply to domains other than chess? Does it help if we give it the model directly, say by explaining the rules directly in the prompt and also explicitly stating to not make illegal moves? If it's failing to apply rules, but excels as a model-maker.. then perhaps it can spit out a model directly from examples, and then I can feed the model into a separate engine that makes correct, deterministic steps that actually honor the model?
Saying that LLMs do or don't understand chess is lazy I guess. My basic point is that the questions above and their implications are so huge and sobering that I'm very uncomfortable with premature congratulations and optimism that seems to be in vogue. Chess performance is ultimately irrelevant of course, as you say, but what sits under the concrete question is more abstract but very serious. Obviously it is dangerous to create tools/processes that work "most of the time", especially when we're inevitably going to be giving them tasks where we can't check or confirm "legal moves".
> When we talk about understanding a simple axiomatic system, understanding means exactly that the entirety of the axioms are modeled and applied correctly 100% of the time.
Yes, but then, when we talk about understanding in LLMs, we talk about existence, but not necessarily about determinism.
Remember that, while chess engines are (I guess?) deterministic systems, LLMs are randomized systems. You give the same context and the same prompt multiple times, and each and every time you get a different response.
To me, this, together with the fact that you have an at least 1-in-10 chance of getting a good move (even for strange scenarios), means that understanding _does exist_ inside the LLM. And the problem following from this is, how to force the LLM to reliably choose the right „paths of thought“ (sorry for that metaphor).