He is testing several models, some of which cannot reliably output legal moves. That's different from saying all models including the one he thinks understands can't generate a legal move in 10 tries.
3.5-turbo-instruct's illegal move rate is about 5 or less in 8205
I also wonder what kind of invalid moves they are. There's "you can't move your knight to j9 that's off the board", "there's already a piece there" and "actually that would leave you in check".
I think it's also significantly harder to play chess if you were to hear a sequence of moves over the phone and had to reply with a followup move, with no space or time to think or talk through moves.