sourcepluck 6 days ago

> For one, gpt-3.5-turbo-instruct rarely suggests illegal moves, even in the late game.

It's claimed that this model "understands" chess, and can "reason", and do "actual logic" (here in the comments).

I invite anyone making that claim to find me an "advanced amateur" (as the article says of the LLM's level) chess player who ever makes an illegal move. Anyone familiar with chess can confirm that it doesn't really happen.

Is there a link to the games where the illegal moves are made?

12
grumpopotamus 6 days ago

I am an expert level chess player and I have multiple people around my level play illegal moves in classic time control games over the board. I have also watched streamers various levels above me try to play illegal moves repeatedly before realizing the UI was rejecting the move because it is illegal.

zoky 6 days ago

I’ve been to many USCF rated tournaments and have never once seen or even heard of anyone over the age of 8 try to play an illegal move. It may happen every now and then, but it’s exceedingly rare. LLMs, on the other hand, will gladly play the Siberian Swipe, and why not? There’s no consequence for doing so as far as they are concerned.

Dr_Birdbrain 6 days ago

There are illegal moves and there are illegal moves. There is trying to move your king five squares forward (which no amateur would ever do) and there is trying to move your King to a square controlled by an unseen piece, which can happen to somebody who is distracted or otherwise off their game.

Trying to castle through check is one that occasionally happens to me (I am rated 1800 on lichess).

CooCooCaCha 6 days ago

This is an important distinction. Anyone with chess experience would never try to move their king 5 spaces, but LLMs will do crazy things like that.

dgfitz 6 days ago

Moving your king controlled by an unrealized opponent square is simply responded to with “check” no?

james_marks 6 days ago

No, that would break the rule that one cannot move into check

dgfitz 6 days ago

Sorry yes, I meant the opponent would point it out. I’ve never played professional chess.

umanwizard 6 days ago

Sure, the opponent would point it out, just like they would presumably point it out if you played any illegal move. In serious tournament games they would probably also stop the clock, call over the arbiter, and inform him or her that you made an illegal move so you can be penalized (e.g. under FIDE rules if you make an illegal move your opponent gets 2 extra minutes on the clock).

That doesn't change that it's an illegal move.

dgfitz 6 days ago

For sure. I didn’t realize moving into check was an illegal move in the sense that I’ve only played casually and the opponent (or myself) points it out.

umanwizard 6 days ago

Yeah, "illegal move" is just a synonym for "things you're not allowed to do". There's no separate category of moves that you aren't allowed to make, but that aren't considered illegal.

jeremyjh 6 days ago

I'm rated 1450 USCF and I think I've seen 3 attempts to play an illegal move across around 300 classical games OTB. Only one of them was me. In blitz it does happen more.

WhyOhWhyQ 6 days ago

Would you say the apparent contradiction between what you and other commenters are saying is partly explained by the high volume of games you're playing? Or do you think there is some other reason?

da_chicken 6 days ago

I wouldn't. I never progressed beyond chess clubs in public schools and I certainly remember people making illegal moves in tournaments. Like that's why they make you both record all the moves. Because people make mistakes. Though, honestly, I remember more notation errors than play errors.

Accidentally moving into check is probably the most common illegal move. Castling though check is surprisingly common, too. Actually moving a piece incorrectly is fairly rare, though. I remember one tournament where one of the matches ended in a DQ because one of the players had two white bishops.

ASUfool 6 days ago

Could one have two white bishops after promoting a pawn?

umanwizard 6 days ago

Yes it's theoretically possible to have two light-squared bishops due to promotions but so exceedingly rare that I think most professional chess players will go their whole career without ever seeing that happen.

da_chicken 6 days ago

Outside of playing a game for piece value? No, not really.

In this case, of course, someone moved their bishop from black to white and their opponent didn't catch it until awhile later.

IanCal 6 days ago

Promoting to anything other than a queen is rare, and I expect the next most common is to a knight. Promoting to a bishop, while possible, is going to be extremely rare.

nurettin 6 days ago

At what level are you considered an expert? IM? CM? 1900 ELO OTB?

umanwizard 6 days ago

In the US at least 2000 USCF is considered "expert".

rgoulter 6 days ago

> I invite anyone making that claim to find me an "advanced amateur" (as the article says of the LLM's level) chess player who ever makes an illegal move. Anyone familiar with chess can confirm that it doesn't really happen.

This is somewhat imprecise (or inaccurate).

A quick search on YouTube for "GM illegal moves" indicates that GMs have made illegal moves often enough for there to be compilations.

e.g. https://www.youtube.com/watch?v=m5WVJu154F0 -- The Vidit vs Hikaru one is perhaps the most striking, where Vidit uses his king to attack Hikaru's king.

banannaise 6 days ago

A bunch of these are just improper procedure: several who hit the clock before choosing a promotion piece, and one who touches a piece that cannot be moved. Even those that aren't look like rational chess moves, they just fail to notice a detail of the board state (with the possible exception of Vidit's very funny king attack, which actually might have been clock manipulation to give him more time to think with 0:01 on the clock).

Whereas the LLM makes "moves" that clearly indicate no ability to play chess: moving pieces to squares well outside their legal moveset, moving pieces that aren't on the board, etc.

fl7305 6 days ago

Can a blind man sculpt?

What if he makes mistakes that a seeing person would never make?

Does that mean that the blind man is not capable of sculpting at all?

sixfiveotwo 6 days ago

> Whereas the LLM makes "moves" that clearly indicate no ability to play chess: moving pieces to squares well outside their legal moveset, moving pieces that aren't on the board, etc.

Do you have any evidence of that? TFA doesn't talk about the nature of these errors.

krainboltgreene 6 days ago

Yeah like several hundred "Chess IM/GMs react to ChatGPT playing chess" videos on youtube.

sixfiveotwo 6 days ago

Very strange, I cannot spot any specifically saying that ChatGPT cheated or played an illegal move. Can you help?

krainboltgreene 6 days ago

https://www.youtube.com/watch?v=iWhlrkfJrCQ He has quite a few of these.

sixfiveotwo 5 days ago

> Yeah like several hundred "Chess IM/GMs react to ChatGPT playing chess" videos on youtube.

If I were to take that sentence literally, I would ask for at least 199 other examples, but I imagine that it was just a figure of speech. Nevertheless, if that's only one player complaining (even several times), can we really conclude that ChatGPT cannot play? Is that enough evidence, or is there something else at work?

I suppose indeed one could, if one expected an LLM to be ready to play out of the box, and that would be a fair criticism.

krainboltgreene 5 days ago

I really wish I hadn't replied to you.

sixfiveotwo 3 days ago

I'm sorry if you feel that way.

I am in no way trying to judge you; rather, I'm trying to get closer to the truth in that matter, and your input is valuable, as it points out a discrepancy wrt TFA, but it is also subject to caution, since it reports the results of only one chess player (right?). Furthermore, both in the case of TFA and this youtuber, we don't have full access to their whole experiments, so we can't reproduce the results, nor can we try to understand why there is a difference.

I might very well be mistaken though, and I am open to criticisms and corrections, of course.

SonOfLilit 6 days ago

But clearly the author got his GPT to play orders of magnitude better than in those videos

krainboltgreene 6 days ago

In no way is it clear, there's no evidence.

quuxplusone 6 days ago

"Most striking" in the sense of "most obviously not ever even remotely legal," yeah.

But the most interesting and thought-provoking one in there is [1] Carlsen v Inarkiev (2017). Carlsen puts Inarkiev in check. Inarkiev, instead of making a legal move to escape check, does something else. Carlsen then replies to that move. Inarkiev challenges: Carlsen's move was illegal, because the only legal "move" at that point in the game was to flag down an arbiter and claim victory, which Carlsen didn't!

[1] - https://www.youtube.com/watch?v=m5WVJu154F0&t=7m52s

The tournament rules at the time, apparently, fully covered the situation where the game state is legal but the move is illegal. They didn't cover the situation where the game state was actually illegal to begin with. I'm not a chess person, but it sounds like the tournament rules may have been amended after this incident to clarify what should happen in this kind of situation. (And Carlsen was still declared the winner of this game, after all.)

LLM-wise, you could spin this to say that the "rational grandmaster" is as fictional as the "rational consumer": Carlsen, from an actually invalid game state, played "a move that may or may not be illegal just because it sounds kinda “chessy”," as zoky commented below that an LLM would have done. He responded to the gestalt (king in check, move the king) rather than to the details (actually this board position is impossible, I should enter a special case).

OTOH, the real explanation could be that Carlsen was just looking ahead: surely he knew that after his last move, Inarkiev's only legal moves were harmless to him (or fatalistically bad for him? Rxb7 seems like Inarkiev's correct reply, doesn't it? Again I'm not a chess person) and so he could focus elsewhere on the board. He merely happened not to double-check that Inarkiev had actually played one of the legal continuations he'd already enumerated in his head. But in a game played by the rules, he shouldn't have to double-check that — it is already guaranteed by the rules!

Anyway, that's why Carlsen v Inarkiev struck me as the most thought-provoking illegal move, from a computer programmer's perspective.

zoky 6 days ago

It’s exceedingly rare, though. There’s a big difference between accidentally falling to notice a move that is illegal in a complicated situation, and playing a move that may or may not be illegal just because it sounds kinda “chessy”, which is pretty much what LLMs do.

ifdefdebug 6 days ago

yes but LLM illegal moves often are not chessy at all. A chessy illegal move for instance would be trying to move a rook when you don't notice that it's between your king and an attacking bishop. LLMs would often happily play Ba4 when there's no bishop anywhere near a square from where it could reach that square, or even no bishop at all. That's not chessy, that's just weird.

I have to admit it's been a while since I played chatgpt so maybe it improved.

tacitusarc 6 days ago

The one where Caruana improperly presses his clock and then claims he did not so as not to lose, and the judges believe him, is frustrating to watch.

_heimdall 6 days ago

This is the problem with LLM researchers all but giving up on the problem of inspecting how the LLM actually works internally.

As long as the LLM is a black box, its entirely possible that (a) the LLM does reason through the rules and understands what moves are legal or (b) was trained on a large set of legal moves and therefore only learned to make legal moves. You can claim either case is the real truth, but we have absolutely no way to know because we have absolutely no way to actually understand what the LLM was "thinking".

codeulike 6 days ago

Here's an article where they teach an LLM Othello and then probe its internal state to assess whether it is 'modelling' the Othello board internally

https://thegradient.pub/othello/

Associated paper: https://arxiv.org/abs/2210.13382

mattmcknight 6 days ago

It's weird because it is not a black box at the lowest level, we can see exactly what all of the weights are doing. It's just too complex for us to understand it.

What is difficult is finding some intermediate pattern in between there which we can label with an abstraction that is compatible with human understanding. It may not exist. For example, it may be more like how our brain works to produce language than it is like a logical rule based system. We occasionally say the wrong word, skip a word, spell things wrong...violate the rules of grammar.

The inputs and outputs of the model are human language, so at least there we know the system as a black box can be characterized, if not understood.

_heimdall 6 days ago

> The inputs and outputs of the model are human language, so at least there we know the system as a black box can be characterized, if not understood.

This is actually where the AI safety debates tend to lose. From where I sit we can't characterize the black box itself, we can only characterize the outputs themselves.

More specifically, we can decide what we think the quality of the output for the given input and we can attempt to infer what might have happened in between. We really have no idea what happened in between, and though many of the "doomers" raise concerns that seem far fetched, we have absolutely no way of understanding whether they are completely off base or raising concerns of a system that just hasn't shown problems in the input/output pairs yet.

lukeschlather 6 days ago

> (a) the LLM does reason through the rules and understands what moves are legal or (b) was trained on a large set of legal moves and therefore only learned to make legal moves.

How can you learn to make legal moves without understanding what moves are legal?

_heimdall 6 days ago

I'm spit balling here so definitely take this with a grain of salt.

If I only see legal moves, I may not think outside the box come up with moves other than what I already saw. Humans run into this all the time, we see things done a certain and effectively learn that that's just how to do it and we don't innovate.

Said differently, if the generative AI isn't actually being generative at all, meaning its just predicting based on the training set, it could be providing only legal moves without ever learning or understanding the rules of the game.

adelineJoOs 6 days ago

I am not a ML person, and know there is an mathematical explanation for what I am about to write, but here comes my informal reasoning:

I fear this is not the case: 1) Either, the LLM (or other forms of deep neural networks) can reproduce exactly what it saw, but nothing new (then it would only produce legal moves, if it was trained on only legal ones) 2) Or, the LLM can produce moves that it did not exactly see, by outputting the "most probable" looking move in that situation (which it never has seen before). In effect, this is combining different situations and their output into a new output. As a result of this "mixing", it might output an illegal move (= the output move is illegal in this new situation), despite having been trained on only legal moves.

In fact, I am not even sure if the deep neuronal networks we use in practice even can replicate their training data exactly - it seems to me that there is some kind of compression going on by embedding knowledge into the network, which will come with a loss.

I am deeply convinced that LLMs will never be exact technology (but LLMs + other technology like proof assistants or compilers might be)

_heimdall 6 days ago

Oh I don't think there is any expectation for LLMs to reproduce any training data exactly. By design an LLM is a lossy compression algorithm, data can't be expected to be an exact reproduction.

The question I have is whether the LLM might be reproducing mostly legal moves only because it was trained on a set of data that itself only included legal moves. The training data would have only helped predict legal moves, and any illegal moves it predicts may very well be because the LLMs are design with random variables as part of the prediction loop.

ramraj07 6 days ago

I think they’ll acknowledge these models are truly intelligent only when the LLMs also irrationally go circles around logic to insist LLMs are statistical parrots.

_heimdall 6 days ago

Acknowledging an LLM is intelligent requires a general agreement of what intelligence is and how to measure it. I'd also argue that it requires a way of understanding how an LLM comes to its answer rather than just inputs and outputs.

To me that doesn't seem unreasonable and has nothing to do with irrationally going in circles, curious if you disagree though.

Retric 6 days ago

Humans judge if other humans are intelligent without going into philosophical circles.

How well they learn completely novel tasks (fail in conversation, pass with training). How well they do complex tasks (debated just look at this thread). How generally knowledgeable they are (pass). How often they do non sensical things (fail).

So IMO it really comes down if you’re judging by peak performances or minimum standards. If I had an employee that preformed as well as an LLM I’d call them an idiot because they needed constant supervision for even trivial tasks, but that’s not the standard everyone is using.

_heimdall 6 days ago

> Humans judge if other humans are intelligent without going into philosophical circles

That's totally fair. I expect that to continue to work well when kept in the context of something/someone else that is roughly as intelligent as you are. Bonus points for the fact that one human understands what it means to be human and we all have roughly similar experiences of reality.

I'm not so sure if that kind of judging intelligence by feel works when you are judging something that is (a) totally different from your or (b) massively more (or less) intelligent than you are.

For example, I could see something much smarter than me as acting irrationally when in reality they may be working with a much larger or complex set of facts and context that don't make sense to me.

raincole 6 days ago

> we have absolutely no way to know

To me, this means that it absolutely doesn't matter whether LLM does reason or not.

_heimdall 6 days ago

It might if AI/LLM safety is a concern. We can't begin to really judge safety without understanding how they work internally.

zarzavat 6 days ago

An LLM is essentially playing blindfold chess if it just gets the moves and not the position. You have to be fairly good to never make illegal moves in blindfold.

pera 6 days ago

A chat conversation where every single move is written down and accessible at any time is not the same as blindfold chess.

gwd 6 days ago

OK, but the LLM is still playing without a board to look at, except what's "in its head". How often would 1800 ELO chess players make illegal moves when playing only using chess notation over chat, with no board to look at?

What might be interesting is to see if there was some sort of prompt the LLM could use to help itself; e.g., "After repeating the entire game up until this point, describe relevant strategic and tactical aspects of the current board state, and then choose a move."

Another thing that's interesting is the 1800 ELO cut-off of the training data. If the cut-off were 2000, or 2200, would that improve the results?

Or, if you included training data but labeled with the player's ELO, could you request play at a specific ELO? Being able to play against a 1400 ELO computer that made the kind of mistakes a 1400 ELO human would make would be amazing.

wingmanjd 6 days ago

MaiaChess [1] supposedly plays at a specific ELO, making similar mistakes a human would make at those levels.

It looks like they have 3 public bots on lichess.org: 1100, 1500, and 1900

[1] https://www.maiachess.com/

zbyforgotp 6 days ago

You can make it available to the player and I suspect it wouldn’t change the outcomes.

sebzim4500 6 days ago

Sure but I'm better than 99% of people at chess and if I was playing under those conditions there is a high chance I would make an illegal move.

lukeschlather 6 days ago

The LLM can't refer to notes, it is just relying on its memory of what input tokens it had.

fmbb 6 days ago

Does it not always have a list of all the moves in the game always at hand in the prompt?

You have to give this human the same log of the game to refer to.

xg15 6 days ago

I think even then it would still be blindfold chess, because humans do a lot of "pattern matching" on the actual board state in front of them. If you only have the moves, you have to reconstruct this board state in your head.

hamilyon2 6 days ago

The discussion in this thread is amazing. People, even renowned experts in their field make mistakes, a lot of mistakes, sometimes very costly and very obvious in retrospect. In their craft.

Yet when LLM, trained on corpus of human stupidity, no less, make illegal moves in chess, our brain immediately goes: I don't make illegal moves in chess, how can computer play chess if it does?

Perfect examples of metacognitive bias and general attribution error at least.

stonemetal12 6 days ago

It isn't a binary does\doesn't question. It is a question of frequency and "quality" of mistakes. If it is making illegal moves 0.1% of the time then sure everybody makes mistakes. If it is 30% of the time then it isn't doing so well. If the illegal moves it tries to make are basic "pieces don't move like that" sort of errors then the predict next token isn't predicting so well. If the legality of the moves is more subtle then maybe it isn't too bad.

But more than being able to make moves, if we claim it understands chess shouldn't be able to explain why it chose a move over another move?

sourcepluck 6 days ago

You would be correct to be amazed if someone was arguing:

"Look! It made mistakes, therefore it's definitely not reasoning!"

That's certainly not what I'm saying, anyway. I was responding to the argument actually being made by many here, which is:

"Look! It plays pretty poorly, but not totally crap, and it wasn't trained for playing just-above-poor chess, therefore, it understands chess and definitely is reasoning!"

I find this - and much of the surrounding discussion - to be quite an amazing display of people's biases, myself. People want to believe LLMs are reasoning, and so we're treated to these merry-go-round "investigations".

stefan_ 6 days ago

No, my brain goes that the machine constantly suggesting "jump off now!" in between the occasional legal chess move probably isn't quite right in the head. And that the people suggesting this is all perfectly fine because we can post-hoc decide what legal moves are and are not even willing to entertain the notion that this invalidates their little experiment, well, maybe that's not the ones we want deploying this kind of thing.

jeremyjh 6 days ago

Yes, I don't even know what it means to say its 1800 strength and yet plays illegal moves frequently enough that you have to code retry logic into the test harness. Under FIDE rules after two illegal moves the game is declared lost by the player making that move. If this rule were followed, I'm wondering what its rating would be.

og_kalu 6 days ago

>Yes, I don't even know what it means to say its 1800 strength and yet plays illegal moves frequently enough that you have to code retry logic into the test harness.

People are really misunderstanding things here. The one model that can actually play at lichess 1800 Elo does not need any of those and will play thousands of moves before a single illegal one. But he isn't just testing that one specific model. He is testing several models, some of which cannot reliably output legal moves (and as such, this logic is required)

alain94040 6 days ago

> find me an "advanced amateur" (as the article says of the LLM's level) chess player who ever makes an illegal move

Without a board to look at, just with the same linear text input given in the prompt? I bet a lot of amateurs would not give you legal moves. No drawing or side piece of paper allowed.

GaggiX 6 days ago

I can confirm that an advanced amateur can play illegal moves by playing blindfold chess as shown in this article.

chis 6 days ago

I agree with others that it’s similar to blindfold chess and would also add that the AI gets no time to “think” without chain of thought like the new o1 models. So it’s equivalent to an advanced player, blindfolded, making moves off pure intuition without system 2 thought.

fl7305 6 days ago

> It's claimed that this model "understands" chess, and can "reason", and do "actual logic" (here in the comments).

You can divide reasoning into three levels:

1) Can't reason - just regurgitates from memory

2) Can reason, but makes mistakes

3) Always reasons perfectly, never makes mistakes

If an LLM makes mistakes, you've proven that it doesn't reason perfectly.

You haven't proven that it can't reason.

eimrine 6 days ago

Do you know what is to reason? LLM can't do Socratus' method, are there any other ways to reason?

fl7305 6 days ago

Not sure what you mean by "LLM can't do Socratus' method"?

But: Plenty of people struggle with playing along with Socrates' method. Can they not reason at all?

eimrine 5 days ago

> Not sure what you mean by "LLM can't do Socratus' method"?

I hope you can translate the conversation of philosophist vs chatgpt from Russian [1]. The conversation from the philosopher is built due to Socrates' method but chatgpt can not even react consistently.

> Plenty of people struggle with playing along with Socrates' method. Can they not reason at all?

I do not hold the opinion that chatgpt "struggles" with Socrates' method, I am clearly seing it can not use it at all even from answering side of Socrates' dualogue which is not that hard. Chatgpt can not use Socrates' method from questioning side of dialogue by design because it never asks questions.

[1] https://hvylya.net/analytics/268340-dialogi-sergeya-dacyuka-...

mattmcknight 6 days ago

> I invite anyone making that claim to find me an "advanced amateur" (as the article says of the LLM's level) chess player who ever makes an illegal move.

I would say the analogy is more like someone saying chess moves aloud. So, just as we all misspeak or misspell things from time to time, the model output will have an error rate.

bjackman 6 days ago

So just because has different failure modes it doesn't count as reasoning? Is reasoning just "behaving exact like a human"? In that case the statement "LLMs can't reason" is unfalsifiable and meaningless. (Which, yeah, maybe it is).

The bizarre intellectual quadrilles people dance to sustain their denial of LLM capabilities will never cease to amaze me.