>Theory 1: Large enough base models are good at chess, but this doesn’t persist through instruction tuning to chat models.
I lean mostly towards this and also the chess notations - not sure if it might get chopped during tokenization unless it's very precisely processed.
It's like designing an LLM just for predicting protein sequence because the sequencing matters. The base data might have it but i don't think that's the intention for it to continue.
This makes me wonder what scenarios would be unlocked if OpenAI gave access to gpt4-instruct.
I wonder if they avoid that due to the potential for negative press from the outputs of a more "raw" model.