Every test task, including the coding test, is a greenfield project. Everything I would consider using LLMs for is not. Like, I would always need it to do some change or fix on a (large) existing project. Hell, even the examples that were generated would likely need subsequent alterations (ten times more effort goes into maintaining a line of code than writing it).
So these tests are meaningless to me, as a measure of how useful these models are. Great for comparison with each other, but would be interesting to include some tests with more realistic work.
Indeed, I surprised to see that is has been in top-10 on HN for today. I thought everyone already realized that all of those examples like "create a flappy bird game" are not realistic and do not reflect the actual usefulness of the model, very few professionals in the industry endlessly create flappy bird games for a living.