> It would be similar to if I claimed that an LLM is an expert doctor, but in my data I've filtered out all of the times it gave incorrect medical advice.
Computationally it's trivial to detect illegal moves, so it's nothing like filtering out incorrect medical advice.
> Computationally it's trivial to detect illegal moves
You're strictly correct, but the rules for chess are infamously hard to implement (as anyone who's tried to write a chess program will know), leading to minor bugs in a lot of chess programs.
For example, there's this old myth about vertical castling being allowed due to ambiguity in the ruleset: https://www.futilitycloset.com/2009/12/11/outside-the-box/ (Probably not historically accurate).
If you move beyond legal positions into who wins when one side flags, the rules state that the other side should be awarded a victory if checkmate was possible with any legal sequence of moves. This is so hard to check that no chess program tries to implement it, instead using simpler rules to achieve a very similar but slightly more conservative result.
That link was new too me, thanks! However: I wrote some chess-program myself (nothing big, hobby level) and I would not call it hard to implement. Just harder than what someone might assume initially. But in the end, it is one of the simpler simulations/algorithms I did. It is just the state of the board, the state of the game (how many turns, castle rights, past positions for the repetition rule, ...) and picking one rule set if one really wants to be exact.
(thinking about which rule set is correct would not be meaningful in my opinion - chess is a social construct, with only parts of it being well defined. I would not bother about the rest, at least not when implementing it)
By the way: I read "Computationally it's trivial" as more along the lines of "it has been done before, it is efficient to compute, one just has to do it" versus "this is new territory, one needs to come up with how to wire up the LLM output with an SMT solver, and we do not even know if/how it will work."
> You're strictly correct, but the rules for chess are infamously hard to implement
Come on. Yeah they're not trivial but they've been done numerous times. There's been chess programs for almost as long as there have been computers. Checking legal moves is a _solved problem_.
Detecting valid medical advice is not. The two are not even remotely comparable.
> Detecting valid medical advice is not. The two are not even remotely comparable.
Uh? Where exactly did I signal my support for LLM's giving medical advice?
We implemented a whole chess engine in lisp during 3rd year it was really trivial actually implementing the legal move/state checking.
I got a kick out of that link. Had certainly never heard of "vertical castling" previously.
As I wrote in another comment - you can write scripts that correct bad math, too. But we don't use that to claim that LLMs have a good understanding of math.
I'd say that's because we don't understand what we mean by "understand".
Hardware that accurately performs maths faster than all of humanity combined is so cheap as to be disposable, but I've yet to see anyone claim that a Pi Zero has "understanding" of anything.
An LLM can display the viva voce approach that Turing suggested[0], and do it well. Ironically for all those now talking about "stochastic parrots", the passage reads:
"""… The game (with the player B omitted) is frequently used in practice under the name of viva voce to discover whether some one really understands something or has ‘learnt it parrot fashion’. …"
Showing that not much has changed on the philosophy of this topic since it was invented.
[0] https://academic.oup.com/mind/article/LIX/236/433/986238
> I'd say that's because we don't understand what we mean by "understand".
I'll have a stab at it. The idea of LLMs 'understanding' maths is that, once having been trained on a set of maths-related material, the LLM will be able to generalise to solve other maths problems that it hasn't encountered before. If an LLM sees all the multiplication tables up to 10x10, and then is correctly able to compute 23x31, we might surmise that it 'understands' multiplication - i.e. that it has built some generalised internal representation of what multiplication is, rather than just memorising all possible answers. Obviously we don't expect generalisation from a Pi Zero without specifically being coded for it, because it's a fixed function piece of hardware.
Personally I think this is highly unlikely given that maths and natural language are very different things, and being good at the latter does not bear any relationship to being good at the former (just ask anyone who struggles with maths - plenty of people do!). Not to mention that it's also much easier to test for understanding of maths because there is (usually!) a single correct answer regardless of how convoluted the problem - compared to natural language where imitation and understanding are much more difficult to tell apart.
I don't know. I have talked to a few math professors, and they think LLMs are as good as a lot of their peers when it comes hallucinations and being able to discuss ideas on very niche topics, as long as the context is fed in. If Tao is calling some models "a mediocre, but not completely incompetent [...] graduate student", then they seem to understand math to some degree to me.
Tao said that about a model brainstorming ideas that might be useful, not explaining complex ideas or generating new ideas or selecting a correct idea from a list of brainstormed ideas. Not replacing a human.
> Not replacing a human.
Obviously not, but that is tangential to this discussion, I think. A hammer might be a useful tool in certain situations, and surely it does not replace a human (but it might make a human in those situations more productive, compared to a human without a hammer).
> generating new ideas
Is brainstorming not an instance of generating new ideas? I would strongly argue so. And whether the LLM does "understand" (or whatever ill-defined, ill-measurable concept one wants to use here) anything about the ideas if produces, and how they might be novel - that is not important either.
If we assume that Tao is adequately assessing the situation and truthfully reporting his findings, then LLMs can, at the current state, at least occasionally be useful in generating new ideas, at least in mathematics.
Being as good as a professor at confidently hallucinating nonsense when you don't know the answer is a very high level skill.
Actually, LLMs do call scripts that correct bad math, and have gotten progressively better because of it. It's another special case example.