>> This is the smoothest tom sawyer move I've ever seen IRL
That made me laugh. True, but not really the motivation. I honestly don't think LLMs can code significant real-world things yet and I'm not sure how else to prove that since they can code some interesting things. All the talk about putting programmers out of work has me calling BS but also thinking "show me". This task seems like a good combination of simple requirements, not much documentation, real world existing problem, non-trivial code size, limited scope.
I agree. I tried something similar: a conversion of a simple PHP library from one system to another. It was only like 500 loc but Gemini 2.5 completely failed around line 300, and even then its output contained straight up hallucinations, half-brained additions, wrong namespaces for dependencies, badly indented code and other PSR style violations. Worse, it also changed working code and broke it.
Try asking it to generate a high-level plan of how it's going to do the conversion first, then to generate function definitions for the new functions, then have it generate tests for the new functions, then actually write them, while giving it the output of the tests.
It's not like people just one-shot a whole module of code, why would LLMs?
> It's not like people just one-shot a whole module of code, why would LLMs?
For conversions between languages or libraries, you often do just one-shot it, writing or modifying code from start to end in order.
I remember 15 years ago taking a 10,000 line Java code base and porting it to JavaScript mostly like this, with only a few areas requiring a bit more involved and non-sequential editing.
I think this shows how the approach LLMs take is wrong. For us it's easy because we simply sort of iterate over every function with a simple prompt of doing a translation, but are yet careful enough taking notes of whatever may be relevant to do a higher level change if necessary.
Maybe the mistake is mistaking LLMs as capable people instead of a simple, but optimised neuron soup tuned for text.
So, you didn't test it until the end? or did you have to build it in such a way that is was partially testable?
One of the nifty things about the target being JavaScript was that I didn’t have to finish it before I could run it—it was the sort of big library where typical code wouldn’t use most of the functionality. It was audio stuff, so there were a couple of core files that needed more careful porting (from whatever in Java to Mozilla’s Audio Data API, which was a fairly good match), and then the rest was fairly routine that could be done gradually, as I needed them or just when I didn’t have anything better to focus on. Honestly, one of the biggest problems was forgetting to prefix instance properties with `this.`
I know many people who can and will one-shot a rewrite of 500 LOC. In my world, 500 LOC is about the length of a single function. I don't understand why we should be talking about generating a high level plan with multiple tests etc. for a single function.
And I don't think this is uncommon. Just a random example from Github, this file is 1800 LOC and 4 functions. It implements one very specific thing that's part of a broader library. (I have no affiliation with this code.)
https://github.com/elemental/Elemental/blob/master/src/optim...
> I don't understand why we should be talking about generating a high level plan with multiple tests etc. for a single function.
You don't have to, you can write it by hand. I thought we were talking about how we can make computers write code, instead of humans, but it seems that we're trying to prove that LLMs aren't useful instead.
No, it's simply being demonstrated that they're not as useful as some claim.
By saying "why do I have to use a specific technique, instead of naively, to get what I want"?
"Why do I have to put in more work to use this tool vs. not using it?"
Which is exactly what I said here:
If we have to break the problem into tiny pieces that can be individually tested in order for LLMs to be useful, I think it clearly limits LLM usability to a particular niche of programming.
> If we have to break the problem into tiny pieces that can be individually tested
Isn't this something that we should have doing for decades of our own volition?
Separation of concerns, single responsibility principle, all of that talk and trend of TDD or at the very least having good test coverage, or writing code that at least can be debugged without going insane (no Heisenbugs, maybe some intermediate variables to stop on in a debugger, instead of just endless chained streams, though opinions are split, at least code that is readable and not 3 pages worth per function).
Because when I see long bits of code that I have to change without breaking anything surrounding them, I don't feel confident in doing that even if it's a codebase I'm familiar with, much less trust an AI on it (at that point it might be a "Hail Mary", a last ditch effort in hoping that at least the AI can find method in the madness before I have to get my own hands dirty and make my hair more gray).
Did you paste it into the chat or did you use it with a coding agent like Cline?
I am majorly impressed with the combination VSCode + Cline + Gemini
Today I had it duplicate an esp32 proram from UDP communication to TCP.
It first copied the file ( funnily enough by writing it again instead of just straight cp ) Then it started to just change all the headers and declarations Then in a third step it changed one bigger function And in the last step it changed some smaller functions
And it reasoned exactly that way "Let's start with this first ... Let's now do this .... " until is was done
I’ve just moved from expensive claudecode to cursor and Gemini - what are you thoughts on cursor vs cline?
Thank you
Programmers who code interesting things likely shouldn’t worry. The legions who code voluminous but shallow corporate apps and glue might be more concerned.
> I honestly don't think LLMs can code significant real-world things yet and I'm not sure how else to prove that since they can code some interesting things
In my experience it seems like it depends on what they’ve been trained on
They can do some pretty amazing stuff in python, but fail even at the most basic things in arm64 assembly
These models have probably not seen a lot of GTK3/4 code and maybe not even a single example of porting between the two versions
I wonder if finetuning could help with that
Yes, very much agree, an interesting benchmark. Particularly because it’s in a “tier 2” framework (gtkmm) in terms of amount of code available to train an LLM on. That tests the LLMs ability to plan and problem solve compared with, say, “convert to the latest version of react” where the LLM has access to tens of thousands (more?) of similar ports in its training dataset and more has to pattern match.
>> Particularly because it’s in a “tier 2” framework (gtkmm) in terms of amount of code available to train an LLM on.
I asked GPT4 to write an empty GTK4 app in C++. I asked for a menu bar with File, Edit, View at the top and two GL drawing areas separated by a spacer. It produced what looked like usable code with a couple lines I suspected were out of place. I did not try to compile it so don't know if it was a hallucination, but it did seem to know about gtkmm 4.
It definitely knows what GTK4 is, when it freaked out on me and lost the code, it was using all gtkmm-4.0 headers, and had the compiler error count down to 10 (most likely with tons of logic errors, but who knows).
But LLMs performance varies (and this is a huge critique!) not just on what they theoretically know, but how, erm, cross-linked it is with everything else, and that requires lots of training data in the topic.
Metaphorically, I think this is a little like the difference for humans in math between being able to list+define techniques to solve integrals vs being able to fluidly apply them without error.
I think a big and very valid critique of LLMs (compared to humans) is that they are stronger at "memory" than reasoning. They use their vast memory as a crutch to hide the weaknesses in their reasoning. This makes benchmarks like "convert from gtkmm3 to gtkmm4" both challenging AND very good benchmarks of what real programmers are able to do.
I suspect if we gave it a similarly sized 2kloc conversion problem with a popular web framework in TS or JS, it would one-shot it. But again, its "cheating" to do this, its leveraging having read a zillion conversion by humans and what they did.
>All the talk about putting programmers out of work
I keep thinking may be specifically Web programmers. Given a lot of the web essentially CRUD / have the same function.