>> Particularly because it’s in a “tier 2” framework (gtkmm) in terms of amount of code available to train an LLM on.
I asked GPT4 to write an empty GTK4 app in C++. I asked for a menu bar with File, Edit, View at the top and two GL drawing areas separated by a spacer. It produced what looked like usable code with a couple lines I suspected were out of place. I did not try to compile it so don't know if it was a hallucination, but it did seem to know about gtkmm 4.
It definitely knows what GTK4 is, when it freaked out on me and lost the code, it was using all gtkmm-4.0 headers, and had the compiler error count down to 10 (most likely with tons of logic errors, but who knows).
But LLMs performance varies (and this is a huge critique!) not just on what they theoretically know, but how, erm, cross-linked it is with everything else, and that requires lots of training data in the topic.
Metaphorically, I think this is a little like the difference for humans in math between being able to list+define techniques to solve integrals vs being able to fluidly apply them without error.
I think a big and very valid critique of LLMs (compared to humans) is that they are stronger at "memory" than reasoning. They use their vast memory as a crutch to hide the weaknesses in their reasoning. This makes benchmarks like "convert from gtkmm3 to gtkmm4" both challenging AND very good benchmarks of what real programmers are able to do.
I suspect if we gave it a similarly sized 2kloc conversion problem with a popular web framework in TS or JS, it would one-shot it. But again, its "cheating" to do this, its leveraging having read a zillion conversion by humans and what they did.