You just made up a definition of "understand". According to that definition, you are of course right. I just don't think it's a reasonable definition. It's also contradicted by the person I was replying to in the sibling comment, where they argue that Stockfish doesn't understand chess, despite Stockfish of course having the "axioms" modeled and applied correctly 100% of the time.
Here are things people say:
Magnus Carlsen has a better understanding of chess than I do. (Yet we both know the precise rules of the game.)
Grandmasters have a very deep understanding of Chess, despite occasionally making illegal moves that are not according to the rules (https://www.youtube.com/watch?v=m5WVJu154F0).
"If AlphaZero were a conventional engine its developers would be looking at the openings which it lost to Stockfish, because those indicate that there's something Stockfish understands better than AlphaZero." (https://chess.stackexchange.com/questions/23206/the-games-al...)
> Undergraduate CS homework for playing any game with any technique would probably have the stipulation that any illegal move disqualifies the submission completely. Whining that it works most of the time would just earn extra pity/contempt as well as an F on the project.
How exactly is this relevant to the question whether LLMs can be said to have some understanding of chess? Can they consistently apply the rules when game states are given in pgn? No. _Very_ few humans without specialized training could either (without using a board as a tool to keep track of the implicit state). They certainly "know" the rules (even if they can't apply them) in the sense that they will state them correctly if you ask them to.
I am not particularly interested in "the industry". It's obvious that if you want a system to play chess, you use a chess engine, not an LLM. But I am interested in what their chess abilities teaches us about how LLMs build world models. E.g.:
Thanks for your thoughtful comment and refs to chase down.
> You just made up a definition of "understand". According to that definition, you are of course right. I just don't think it's a reasonable definition. ... Here are things people say:
Fine. As others have pointed out and I hinted at.. debating terminology is kind of a dead end. I personally don't expect that "understanding chess" is the same as "understanding Picasso", or that those phrases would mean the same thing if they were applied to people vs for AI. Also.. I'm also not personally that interested in how performance stacks up compared to humans. Even if it were interesting, the topic of human-equivalent performance would not have static expectations either. For example human-equivalent error rates in AI are much easier for me to expect and forgive in robotics than they are in axiomatic game-play.
> I am interested in what their chess abilities teaches us about how LLMs build world models
Focusing on the single datapoint that TFA is establishing: some LLMs can play some chess with some amount of expertise, with some amount of errors. With no other information at all, this tells us that it failed to model the rules, or it failed in the application of those rules, or both.
Based on that, some questions worth asking: Which one of these failure modes is really acceptable and in which circumstances? Does this failure mode apply to domains other than chess? Does it help if we give it the model directly, say by explaining the rules directly in the prompt and also explicitly stating to not make illegal moves? If it's failing to apply rules, but excels as a model-maker.. then perhaps it can spit out a model directly from examples, and then I can feed the model into a separate engine that makes correct, deterministic steps that actually honor the model?
Saying that LLMs do or don't understand chess is lazy I guess. My basic point is that the questions above and their implications are so huge and sobering that I'm very uncomfortable with premature congratulations and optimism that seems to be in vogue. Chess performance is ultimately irrelevant of course, as you say, but what sits under the concrete question is more abstract but very serious. Obviously it is dangerous to create tools/processes that work "most of the time", especially when we're inevitably going to be giving them tasks where we can't check or confirm "legal moves".