theptip 6 days ago

This is a crazy goal-post move. TFA is proving a positive capability, and rejecting the null hypothesis that “LLMs can’t think they just regurgitate”.

Making some illegal moves doesn’t invalidate the demonstrated situational logic intelligence required to play at ELO 1800.

(Another angle: a human on Chess.com also has any illegal move they try to make ignored, too.)

2
photonthug 6 days ago

> Making some illegal moves doesn’t invalidate the demonstrated situational logic intelligence

That’s exactly what it does. 1 illegal move in 1 million or 100 million or any other sample size you want to choose means it doesn’t understand chess.

People in this thread are really distracted by the medical analogy so I’ll offer another: you’ve got a bridge that allows millions of vehicles to cross, and randomly falls down if you tickle it wrong, maybe a car of rare color. One key aspect of bridges is that they work reliably for any vehicle, and once they fail they don’t work with any vehicle. A bridge that sometimes fails and sometimes doesn’t isn’t a bridge as much as a death trap.

og_kalu 6 days ago

>1 illegal move in 1 million or 100 million or any other sample size you want to choose means it doesn’t understand chess

Highly rated chess players make illegal moves. It's rare but it happens. They don't understand chess ?

photonthug 6 days ago

> Then no human understands chess

Humans with correct models may nevertheless make errors in rule applications. Machines are good at applying rules, so when they fail to apply rules correctly, it means they have incorrect, incomplete, or totally absent models.

Without using a word like “understands” it seems clear that the same apparent mistake has different causes.. and model errors are very different from model-application errors. In a math or physics class this is roughly the difference between carry-the-one arithmetic errors vs using an equation from a completely wrong domain. The word “understands” is loaded in discussion of LLMs, but everyone knows which mistake is going to get partial credit vs zero credit on an exam.

og_kalu 6 days ago

>Humans with correct models may nevertheless make errors in rule applications. Ok

>Machines are good at applying rules, so when they fail to apply rules correctly, it means they have incorrect or incomplete models.

I don't know why people continue to force the wrong abstraction. LLMs do not work like 'machines'. They don't 'follow rules' the way we understand normal machines to 'follow rules'.

>so when they fail to apply rules correctly, it means they have incorrect or incomplete models.

Everyone has incomplete or incorrect models. It doesn't mean we always say they don't understand. Nobody says Newton didn't understand gravity.

>Without using a word like “understands” it seems clear that the same apparent mistake has different causes.. and model errors are very different from model-application errors.

It's not very apparent no. You've just decided it has different causes because of preconceived notions on how you think all machines must operate in all configurations.

LLMs are not the logic automatons in science fiction. They don't behave or act like normal machines in any way. The internals run some computations to make predictions but so does your nervous system. Computation is substrate-independent.

I don't even know how you can make this distinction without seeing what sort of illegal moves it makes. If it makes the sort high rated players make then what ?

photonthug 6 days ago

I can’t tell if you are saying the distinction between model errors and model-application errors doesn’t exist or doesn’t matter or doesn’t apply here.

og_kalu 6 days ago

I'm saying:

- Generally, we do not say someone does not understand just because of a model error. The model error has to be sufficiently large or the model sufficiently narrow. No-one says Newton didn't understand gravity just because his model has an error in it but we might say he didn't understand some aspects of it.

- You are saying the LLM is making a model error (rather than an an application error) only because of preconceived notions of how 'machines' must behave, not on any rigorous examination.

photonthug 6 days ago

Suppose you're right, the internal model of game rules is perfect but the application of the model for next-move is imperfect. Unless we can actually separate the two, does it matter? Functionally I mean, not philosophically. If the model was correct, maybe we could get a useful version of it out by asking it to write a chess engine instead of act as a chess engine. But when the prolog code for that is as incorrect as the illegal chess move was, will you say again that the model is correct, but the usage of it resulted merely resulted in minor errors?

> You are saying the LLM is making a model error (rather than an an application error) only because of preconceived notions of how 'machines' must behave, not on any rigorous examination.

Here's an anecdotal examination. After much talk about LLMs and chess, and math, and formal logic here's the state of the art, simplified from dialog with gpt today:

> blue is red and red is blue. what color is the sky? >> <blah blah, restates premise, correctly answer "red">

At this point fans rejoice, saying it understands hypotheticals and logic. Dialogue continues..

> name one red thing >> <blah blah, restates premise, incorrectly offers "strawberries are red">

At this point detractors rejoice, declare that it doesn't understand. Now the conversation devolves into semantics or technicalities about prompt-hacks, training data, weights. Whatever. We don't need chess. Just look it, it's broken as hell. Discussing whether the error is human-equivalent isn't the point either. It's broken! A partially broken process is no solid foundation to build others on. And while there are some exceptions, an unreliable tool/agent is often worse than none at all.

og_kalu 6 days ago

>It's broken! A partially broken process is no solid foundation to build others on. And while there are some exceptions, an unreliable tool/agent is often worse than none at all.

Are humans broken ? Because our reasoning is a very broken process. You say it's no solid foundation ? Take a look around you. This broken processor is the foundation of society and the conveniences you take for granted.

The vast vast majority of human history, there wasn't anything even remotely resembling a non-broken general reasoner. And you know the funny thing ? There still isn't. When people like you say LLMs don't reason, they hold them to a standard that doesn't exist. Where is this non-broken general reasoner in anywhere but fiction and your own imagination?

>And while there are some exceptions, an unreliable tool/agent is often worse than none at all.

Since you are clearly meaning unreliable to be 'makes no mistake/is not broken' then no human is a reliable agent. Clearly, the real exception is when an unreliable agent is worse than nothing at all.

bawolff 6 days ago

This feels more like a metaphysical argument about what it means to "know" something, which is really irrelevant to what is interesting about the article.

sixfiveotwo 6 days ago

> Machines are good at applying rules, so when they fail to apply rules correctly, it means they have incorrect, incomplete, or totally absent models.

That's assuming that, somehow, a LLM is a machine. Why would you think that?

photonthug 6 days ago

Replace the word with one of your own choice if that will help us get to the part where you have a point to make?

I think we are discussing whether LLMs can emulate chess playing machines, regardless of whether they are actually literally composed of a flock of stochastic parrots..

sixfiveotwo 6 days ago

That's simple logic. Quoting you again:

> Machines are good at applying rules, so when they fail to apply rules correctly, it means they have incorrect, incomplete, or totally absent models.

If this line of reasoning applies to machines, but LLMs aren't machines, how can you derive any of these claims?

"A implies B" may be right, but you must first demonstrate A before reaching conclusion B..

> I think we are discussing whether LLMs can emulate chess playing machines

That is incorrect. We're discussing whether LLMs can play chess. Unless you think that human players also emulate chess playing machines?

XenophileJKO 6 days ago

Engineers really have a hard time coming to terms with probabilistic systems.

benediktwerner 6 days ago

Try giving a random human 30 chess moves and ask them to make a non-terrible legal move. Average humans even quite often try to make illegal moves when clearly seeing the board before them. There are even plenty of cases where people reported a bug because the chess application didn't let them do an illegal move they thought was legal.

And the sudden comparison to something that's safety critical is extremely dumb. Nobody said we should tie the LLM to a nuclear bomb that explodes if it makes a single mistake in chess.

The point is that it plays at a level far far above making random legal moves or even average humans. To say that that doesn't mean anything because it's not perfect is simply insane.

photonthug 6 days ago

> And the sudden comparison to something that's safety critical is extremely dumb. Nobody said we should tie the LLM to a nuclear bomb that explodes if it makes a single mistake in chess.

But it actually is safety critical very quickly whenever you say something like “works fine most of the time, so our plan going forward is to dismiss any discussion of when it breaks and why”.

A bridge failure feels like the right order of magnitude for the error rate and effective misery that AI has already quietly caused with biased models where one in a million resumes or loan applications is thrown out. And a nuclear bomb would actually kill less people than a full on economic meltdown. But I’m sure no one is using LLMs in finance at all right?

It’s so arrogant and naive to ignore failure modes that we don’t even understand yet.. at least bridges and steel have specs. Software “engineering” was always a very suspect name for the discipline but whatever claim we had to it is worse than ever.

wavemode 6 days ago

It's not a goalpost move. As I've already said, I have the exact same problem with this article as I had with the previous one. My goalposts haven't moved, and my standards haven't changed. Just provide the data! How hard can it be? Why leave it out in the first place?