Here is a real coding problem that I might be willing to make a cash-prize contest for. We'd need to nail down some rules. I'd be shocked if any LLM can do this:
https://github.com/solvespace/solvespace/issues/1414
Make a GTK 4 version of Solvespace. We have a single C++ file for each platform - Windows, Mac, and Linux-GTK3. There is also a QT version on an unmerged branch for reference. The GTK3 file is under 2KLOC. You do not need to create a new version, just rewrite the GTK3 Linux version to GTK4. You may either ask it to port what's there or create the new one from scratch.
If you want to do this for free to prove how great the AI is, please document the entire session. Heck make a YouTube video of it. The final test is weather I accept the PR or not - and I WANT this ticket done.
I'm not going to hold my breath.
This is the smoothest tom sawyer move I've ever seen IRL, I wonder how many people are now grinding out your GTK4 port with our favorite LLM/system to see if it can. It'll be interesting to see if anyone gets something working with current-gen LLMs.
UPDATE: naive (just fed it your description verbatim) cline + claude 3.7 was a total wipeout. It looked like it was making progress, then freaked out, deleted 3/4 of its port, and never recovered.
>> This is the smoothest tom sawyer move I've ever seen IRL
That made me laugh. True, but not really the motivation. I honestly don't think LLMs can code significant real-world things yet and I'm not sure how else to prove that since they can code some interesting things. All the talk about putting programmers out of work has me calling BS but also thinking "show me". This task seems like a good combination of simple requirements, not much documentation, real world existing problem, non-trivial code size, limited scope.
I agree. I tried something similar: a conversion of a simple PHP library from one system to another. It was only like 500 loc but Gemini 2.5 completely failed around line 300, and even then its output contained straight up hallucinations, half-brained additions, wrong namespaces for dependencies, badly indented code and other PSR style violations. Worse, it also changed working code and broke it.
Try asking it to generate a high-level plan of how it's going to do the conversion first, then to generate function definitions for the new functions, then have it generate tests for the new functions, then actually write them, while giving it the output of the tests.
It's not like people just one-shot a whole module of code, why would LLMs?
> It's not like people just one-shot a whole module of code, why would LLMs?
For conversions between languages or libraries, you often do just one-shot it, writing or modifying code from start to end in order.
I remember 15 years ago taking a 10,000 line Java code base and porting it to JavaScript mostly like this, with only a few areas requiring a bit more involved and non-sequential editing.
I think this shows how the approach LLMs take is wrong. For us it's easy because we simply sort of iterate over every function with a simple prompt of doing a translation, but are yet careful enough taking notes of whatever may be relevant to do a higher level change if necessary.
Maybe the mistake is mistaking LLMs as capable people instead of a simple, but optimised neuron soup tuned for text.
So, you didn't test it until the end? or did you have to build it in such a way that is was partially testable?
One of the nifty things about the target being JavaScript was that I didn’t have to finish it before I could run it—it was the sort of big library where typical code wouldn’t use most of the functionality. It was audio stuff, so there were a couple of core files that needed more careful porting (from whatever in Java to Mozilla’s Audio Data API, which was a fairly good match), and then the rest was fairly routine that could be done gradually, as I needed them or just when I didn’t have anything better to focus on. Honestly, one of the biggest problems was forgetting to prefix instance properties with `this.`
I know many people who can and will one-shot a rewrite of 500 LOC. In my world, 500 LOC is about the length of a single function. I don't understand why we should be talking about generating a high level plan with multiple tests etc. for a single function.
And I don't think this is uncommon. Just a random example from Github, this file is 1800 LOC and 4 functions. It implements one very specific thing that's part of a broader library. (I have no affiliation with this code.)
https://github.com/elemental/Elemental/blob/master/src/optim...
> I don't understand why we should be talking about generating a high level plan with multiple tests etc. for a single function.
You don't have to, you can write it by hand. I thought we were talking about how we can make computers write code, instead of humans, but it seems that we're trying to prove that LLMs aren't useful instead.
No, it's simply being demonstrated that they're not as useful as some claim.
By saying "why do I have to use a specific technique, instead of naively, to get what I want"?
"Why do I have to put in more work to use this tool vs. not using it?"
Which is exactly what I said here:
If we have to break the problem into tiny pieces that can be individually tested in order for LLMs to be useful, I think it clearly limits LLM usability to a particular niche of programming.
> If we have to break the problem into tiny pieces that can be individually tested
Isn't this something that we should have doing for decades of our own volition?
Separation of concerns, single responsibility principle, all of that talk and trend of TDD or at the very least having good test coverage, or writing code that at least can be debugged without going insane (no Heisenbugs, maybe some intermediate variables to stop on in a debugger, instead of just endless chained streams, though opinions are split, at least code that is readable and not 3 pages worth per function).
Because when I see long bits of code that I have to change without breaking anything surrounding them, I don't feel confident in doing that even if it's a codebase I'm familiar with, much less trust an AI on it (at that point it might be a "Hail Mary", a last ditch effort in hoping that at least the AI can find method in the madness before I have to get my own hands dirty and make my hair more gray).
Did you paste it into the chat or did you use it with a coding agent like Cline?
I am majorly impressed with the combination VSCode + Cline + Gemini
Today I had it duplicate an esp32 proram from UDP communication to TCP.
It first copied the file ( funnily enough by writing it again instead of just straight cp ) Then it started to just change all the headers and declarations Then in a third step it changed one bigger function And in the last step it changed some smaller functions
And it reasoned exactly that way "Let's start with this first ... Let's now do this .... " until is was done
I’ve just moved from expensive claudecode to cursor and Gemini - what are you thoughts on cursor vs cline?
Thank you
Programmers who code interesting things likely shouldn’t worry. The legions who code voluminous but shallow corporate apps and glue might be more concerned.
> I honestly don't think LLMs can code significant real-world things yet and I'm not sure how else to prove that since they can code some interesting things
In my experience it seems like it depends on what they’ve been trained on
They can do some pretty amazing stuff in python, but fail even at the most basic things in arm64 assembly
These models have probably not seen a lot of GTK3/4 code and maybe not even a single example of porting between the two versions
I wonder if finetuning could help with that
Yes, very much agree, an interesting benchmark. Particularly because it’s in a “tier 2” framework (gtkmm) in terms of amount of code available to train an LLM on. That tests the LLMs ability to plan and problem solve compared with, say, “convert to the latest version of react” where the LLM has access to tens of thousands (more?) of similar ports in its training dataset and more has to pattern match.
>> Particularly because it’s in a “tier 2” framework (gtkmm) in terms of amount of code available to train an LLM on.
I asked GPT4 to write an empty GTK4 app in C++. I asked for a menu bar with File, Edit, View at the top and two GL drawing areas separated by a spacer. It produced what looked like usable code with a couple lines I suspected were out of place. I did not try to compile it so don't know if it was a hallucination, but it did seem to know about gtkmm 4.
It definitely knows what GTK4 is, when it freaked out on me and lost the code, it was using all gtkmm-4.0 headers, and had the compiler error count down to 10 (most likely with tons of logic errors, but who knows).
But LLMs performance varies (and this is a huge critique!) not just on what they theoretically know, but how, erm, cross-linked it is with everything else, and that requires lots of training data in the topic.
Metaphorically, I think this is a little like the difference for humans in math between being able to list+define techniques to solve integrals vs being able to fluidly apply them without error.
I think a big and very valid critique of LLMs (compared to humans) is that they are stronger at "memory" than reasoning. They use their vast memory as a crutch to hide the weaknesses in their reasoning. This makes benchmarks like "convert from gtkmm3 to gtkmm4" both challenging AND very good benchmarks of what real programmers are able to do.
I suspect if we gave it a similarly sized 2kloc conversion problem with a popular web framework in TS or JS, it would one-shot it. But again, its "cheating" to do this, its leveraging having read a zillion conversion by humans and what they did.
>All the talk about putting programmers out of work
I keep thinking may be specifically Web programmers. Given a lot of the web essentially CRUD / have the same function.
I suspect it probably won't work, although it's not necessarily because an LLM architecture could never perform this type of work, but rather because it works best when the training set contains inordinate sample data. I'm actually quite shocked at what they can do in TypeScript and JavaScript, but they're definitely a bit less "sharp" when it comes to stuff outside of that zone in my experience.
The ridiculous amount of data required to get here hints that there is something wrong in my opinion.
I'm not sure if we're totally on the same page, but I understand where you're coming from here. Everyone keeps talking about how transformational these models are, but when push comes to shove, the cynicism isn't out of fear or panic, its disappointment over and over and over. Like, if we had an army of virtual programmers fixing serious problems for open source projects, I'd be more excited about the possibilities than worried about the fact that I just lost my job. Honest to God. But the thing is, if that really were happening, we'd see it. And it wouldn't have to be forced and exaggerated all the time, it would be plainly obvious, like the way AI art has absolutely flooded the Internet... except I don't give a damn if code is soulless as long as it's good, so it would possibly be more welcome. (The only issue is that it most likely actually suck when that happens, and rather just be functional enough to get away with, but I like to try to be optimistic once in a while.)
You really make me want to try this, though. Imagine if it worked!
Someone will probably beat me to it if it can be done, though.
Imo they are still extremely limited compared to a senior coder. Take python, still most top ranking models struggle with our codebase, every now and then I try to test few, and handling complex part of the codebase to produce coherent features still fails. They require heavy handholding from our senior devs, which I am sure they use AI as assistants.
> the cynicism isn't out of fear or panic, its disappointment over and over and over
Very much this. When you criticize LLM's marketing, people will say you're a ludite.
I'd bet that no one actually likes to write code, as in typing into an editor. We know how to do it, and it's easy enough to enter in a flow state while doing it. But everyone is trying to write less code by themselves with the proliferation of reusable code, libraries, framework, code generators, metaprogramming,...
I'd be glad if I could have a DAW or CAD like interface with very short feedback (the closest is live programming with Smalltalk). So that I don't have to keep visualizing the whole project (it's mentally taxing).
> no one actually likes to write code
between this and..
> But everyone is trying to write less code by themselves with the proliferation of reusable code, libraries, framework, code generators, metaprogramming
.. this, is a massive gap. Personally speaking, I hate writing boilerplate code, y'know, old school Java with design patterns getter/setter, redundant multi-layer catch blocks, stateful for loops etc. That gets on my nerves, because it increases my work for little benefits. Cue modern coding practices and I'm almost exclusively thinking how to design solution to the problem at hand, and almost all the code is business logic exclusive.
This is where a lot of LLMs just fail. Handholding them all the way to correct solution feels like writing boilerplate again, except worse because I don't know when I'll be done. It doesn't help that most code available for LLMs is JS/TS/Java where boilerplate galore, but somehow I doubt giving them exclusively good codebases will help.
I like writing code. It's a fun and creative endeavor to figure out how to write as little as possible.
>I'd bet that no one actually likes to write code
And you'd be wrong. I, for one, enjoy the process of handcrafting the individual mechanisms of the systems I create.
Do you like writing all the if, def, public void, import keywords? That is what I’m talking about. I prefer IDE for java and other verbose languages because of the code generation. And I configure my editors for templates and snippets because I don’t like to waste time on entering every single character (and learned vim because I can act on bigger units; words, lines, whole blocks).
I like programming, I do not like coding.
I'm not bothered by if nor def. public void can be annoying but it's also fast to type and it doesn't bother me. For import I always try my best at having some kind of autoimport. I too use vim and use macros for many things.
To be honest I'm more annoyed by having to repeat three times parameters in class constructors (args, member declaration and assignment), and I have a macro for it.
The thing is, most of the time I know what I want to write before I start writing. At that point, writing the code is usually the fastest way to the result I want.
Using LLMs usually requires more writing and iterations; plus waiting for whatever it generates, reading it, understanding it and deciding if that's what I wanted; and then it suddenly goes crazy half way through a session and I have to start over...
> if that really were happening, we'd see it.
You're right, instead what we see is the emergence of "vibe coding", which I can best describe as a summoning ritual for technical debt and vulnerabilities.
the typescript and javascript business though - the ais definitely trained on old old javascript.
i kinda think "javacript, the good parts" should be part of the prompt for generating TS and JS. I've seen too much of ai writing the sketchy bad parts
So yesterday I wanted to convert a color pallet I had in Lua that was 3 rgb ints, to Javascript 0x000000 notation. I sighed, rolled my eyes, but before I started this incredibly boring mindless task, asked Gamini if it would just do it for me. It worked, and I was happy, and I moved on.
Something is happening, its just not exciting as some people make it sound.
Be a bit more careful with that particular use case. It usually works, but depending on circumstances, LLMs have a relatively high tendency to start making the wrong correlations and give you results that are not actually accurate. (Colorspace conversions make it more obvious, but I think even simpler problems can get screwed up.)
Of course, for that use case, you can _probably_ do a bit of text processing in your text processing tools of choice to do it without LLMs. (Or have LLMs write the text processing pipeline to do it.)
Convert the GTK 3 and GTK 4 API documentation into a single `.txt` file each.
Upload one of your platform-specific C++ file's source, along with the doc `.txt` into your LLM of choice.
Either ask it for a conversion function-by-function, or separate it some other way logically such that the output doesn't get truncated.
Would be surprised if this didn't work, to be honest.
Do you really need to provide the docs? I would have imagined that those docs are included in their training sets. There is even a guide on how to migrate from GTK3 to GTK4, so this seems to be a low-hanging fruit job for an LLM iff they are okay for coding.
Feeding them the docs makes a huge difference in my experience. The docs might be somewhere in the training set, but telling the LLM explicitly "Use these docs before anything else" solves a lot of problems the the LLM mixing up different versions of a library or confusing two different libraries with a similar API.
LLMs are not data archives. They are god awful at storing data, and even calling them a lossy compression tool is a stretch because it implies they are a compression tool for data.
LLM's will always benefit from in context learning because they don't have a huge archive of data to draw on (and even when they do, they are not the best at selecting data to incorporate).
You might not need to, but LLMs don't have perfect recall -- they're (variably) lossy by nature. Providing documentation is a pretty much universally accepted way to drastically improve their output.
It moves the model from 'sorta-kinda-maybe-know-something-about-this' to being grounded in the context itself. Huge difference for anything underrepresented (not only obscure packages and not-Python not-JS languages).
Docs make them hallucinate a lot less. Unfortunately, all those docs will eat up the context window. Claude has "projects" for uploading them and Gemini2.5+ just has a very large window so maybe that's ok.
In my experience even feeding it the docs probably won't get it there, but it usually helps. It actually seems to work better if the document you're feeding it is also in the training data, but I'm not an expert.
The training set is huge and model "forgets" some of the stuff, providing docs in context makes sense, plus docs could be up to date comparing to training set.
My coding challenges are all variations on "start with this 1.5M line Spring project, full of multi-thousand-line files..."
To do the challenge one would just need to understand the platform abstraction layer which is pretty small, and write 1K to 2K LOC. We don't even use much of the GUI toolkit functionality. I certainly don't need to understand the majority of a codebase to make meaningful contributions in specific areas.
But you are aware that their limited context length just won't be able to deal with this?
That's like saying that you're judging a sedan by its capability of performing the job of a truck.
Wait, you were being sarcastic?
I am indeed saying that a sedan is incapable of handling my gigantic open-pit superfund site.
But I'll go a little farther - most meaningful, long-lived, financially lucrative software applications are metaphorically closer to the open-pit mine than the adorable backyard garden that AI tools can currently handle.
FWIW, what I want most in Solvespace is a way to do chamfers and fillets.
And a way to define parameters (not sure if that's already possible).
>> FWIW, what I want most in Solvespace is a way to do chamfers and fillets.
I've outlined a function for that and started to write the code. At a high level it's straight forward, but the details are complex. It'll probably be a year before it's done.
>> And a way to define parameters (not sure if that's already possible).
This is an active work in progress. A demo was made years ago, but it's buggy and incomplete. We've been working out the details on how to make it work. I hope to get the units issue dealt with this week. Then the relation constraints can be re-integrated on top - that's the feature where you can type arbitrary equations on the sketch using named parameters (variables). I'd like that to be done this year if not this summer.
While I second the same request, I'm also incredibly grateful for Solvespace as a tool. It's my favorite MCAD program, and I always reach for it before any others. Thank you for your work on it!
Sounds great, thanks for all the good work!
By the way, if this would make things simpler, perhaps you can implement chamfering as a post-processing step. This makes it maybe less general, but it would still be super useful.
> I'm not going to hold my breath.
The snark and pessimism nerd-sniped me :)
I've used AI heavily to maintain a cross-platform wrapper around llama.cpp. I figure its worth a shot.
I took a look and wanted to try but hit several hard blocks right away.
- There is no gtk-4 branch :o (presuming branch = git branch...Perhaps this is some project-specific terminology for a set of flags or something, and that's why I can't find it?)
- There's some indicators it is blocked by wxWidgets requiring GTK-4 support, which sounds much larger scope than advertised -- am I misunderstanding?
You guys really need a Docker build. This dependency chain with submodules is a nightmare.
I'm a hater of complexity and build systems in general. Following the instructions for building solvespace on Linux worked for me out of the box with zero issues and is not difficult. Just copy some commands:
https://github.com/solvespace/solvespace?tab=readme-ov-file#...
>I'm a hater of complexity and build systems in general.
But you already have a complex cmake build system in place. Adding a standard Docker image with all the deps for devs to compile on would do nothing but make contributing easier, and would not affect your CI/CD/testing pipeline at all. I followed the readme and spent half an hour trying to get this to build for MacOS before giving up.
If building your project for all supported environments requires anything more than a single one-line command, you're doing it wrong.
>> But you already have a complex cmake build system in place.
I didn't build it :-(
>> Adding a standard Docker image with all the deps for devs to compile on would do nothing but make contributing easier, and would not affect your CI/CD/testing pipeline at all.
I understand, but to me that's just more stuff to maintain and learn. Everyone wants to push their build setup upstream - snap packages, flatpak, now we need docker... And then you and I complain that the build system is complex, partly because it supports so many options. But it looks like the person taking up the AI challenge here is using Docker, so maybe we'll get that as a side effect :-)
I'm sympathetic in general, but in this case:
"You will need git, XCode tools, CMake and libomp. Git, CMake and libomp can be installed via Homebrew"
That really doesn't seem like much. Was there more to it than this?
Edit: I tried it myself and the cmake configure failed until I ran `brew link --force libomp`, after which it could start to build, but then failed again at:
[ 55%] Building CXX object src/CMakeFiles/solvespace-core.dir/bsp.cpp.o
c++: error: unknown argument: '-Xclang -fopenmp'
Alternative perspective: you kids with your Docker builds need to roll up your sleeves and learn how to actually compile a semi-complicated project if you expect to be able to contribute back to said project.
If your project is hard to build, that's your problem, not mine. I'll simply spend my time working on projects that respect it.
I can see both perspectives! But honestly, making a project easier to build is almost always a good use of time if you'd like new people to contribute.
>"Alternative perspective: you kids with your Docker builds need to roll up your sleeves and learn how to actually compile a semi-complicated project if you expect to be able to contribute back to said project."
Well, that attitude is probably why the issue has been open for 2 years.
Send the whole repo to AI Studio using my vibe coded tool `llm_globber` and let Gemini chew on it. You can get this done in a few hours.
Curious if you’ve tried this yourself yet? I’d love to see side by side of a human solo vs a human with copilot for something like this. AI will surely make mistakes so who will be faster / have better code in the end?
>> Curious if you’ve tried this yourself yet?
Yes. I did a lot of the 3->4 prep work. But there were so many API changes... I attempted to do it by commenting out anything that wouldn't build and then bring it back incrementally by doing it the GTK4 way. So much got commented out that it was just a big mess of stubs with dead code inside.
I suspect the right way to do it is from scratch as a new platform. People have done this, but it will require more understanding of the paltform abstraction and how it's supposed to work (It's not my area of the code). I just want to "convert" what was there and failed.
Break it down into smaller problems.
What’s the point of a one-to-one GTK3 → GTK4 rewrite when the user experience doesn’t improve at all?
Why not modularize the backend and build a better UI with tech that’s actually relevant in 2025?
I'm not the person you are asking but the point of this whole thing seems to be as a test for how possible it is for an LLM to 'vibe code' a port of this nature and not really because they care that much about a port existing.
The fact that they haven't done the port in the normal way suggests they basically agree with what you said here (not worth the ROI), but hey if you can get the latest AI code editor to spit out a perfectly working port in minutes, why not?
FWIW, my assessment of LLMs is the same as theirs. The hype is far greater than the practical usefulness, and I say this as someone who is using LLMs pretty regularly now.
They aren't useless, but the idea that they will be writing 90% of our code soon is just completely at odds with my day to day experience getting them to do actual specific tasks rather than telling them to "write Tetris for XYZ" and blog about how great they are because it produced something roughly what I asked for without much specificity.
> Why not modularize the backend and build a better UI with tech that’s actually relevant in 2025?
Doing the second part is to my understanding actually the purpose of the stated task.
Why are you calling GTK4 irrelevant? Large swaths of Linux run on it and GTK3
Might be someone implying that electron is a superior (modern) solution. Which, if so, I whole heartedly disagree with.
> Why are you calling GTK4 irrelevant?
Quite the opposite: Gtk4 is relevant, and porting Solvespace to this relevant toolkit is the central part of the stated task.
>> What’s the point of a one-to-one GTK3 → GTK4 rewrite when the user experience doesn’t improve at all?
I'd like to use the same UI on all platforms so that we can do some things better (like localization in the text window and resizable text) and my preference for that is GTK. I tried doing it myself, got frustrated, and stopped because there are more important things to work on.
It's not AI, but I have good news for you though : what you seek already exists !
This does not look like a Gtk4 port of Solvespace, but like another independent CAD application that uses Gtk4 for its GUI on GNU/Linux.
Yes, we are all well aware of Dune3d. I'm a big fan of Lukas K's work. In fact I wish he had done our GTK port first, and then forked Solvespace to use Open Cascade to solve the problems he needed to address. That would have given me this task for free ;-) We are not currently planning to incorporate OCCT but to simply extend and fix the small NURBS kernel that Solvespace already has.
Can you comment on the business case here? I think there was a Blender add on that uses Solvespace under the hood to give it CAD-like functionality.
I don’t know any pros using Solvespace by itself, and my own opinion is that CAD is the wrong paradigm for most of the things it’s used for anyway (like highway design).
GTK is an abomination of a UI framework. You should be looking for another way to manage your UI entirely, not trying to keep up with the joneses, who will no doubt release something new in short order and set yet another hoop to jump through, without providing any benefit to you at all.
It's openly hostile to not consider the upgrade path of existing users, and make things so difficult that it requires huge lifts just to upgrade versions of something like a UI framework.
>> GTK is an abomination of a UI framework.
I respectfully disagree with that. I think it's a solid UI framework, but...
>> It's openly hostile to not consider the upgrade path of existing users, and make things so difficult that it requires huge lifts just to upgrade versions of something like a UI framework.
I completely agree with you on that. We barely use any UI widgets so you'd think the port would be easy enough. I went through most of the checklist for changes you can make while still using GTK3 in prep for 4. "Don't access event structure members directly, use accessor functions." OK I made that change which made the code a little more verbose. But then they changed a lot of the accessor functions going from 3 to 4. Like WTF? I'm just trying to create a menu but menus don't exist any more - you make them out of something else. Oh and they're not windows they are surfaces. Like why?
I hope with some of the big architectural changes out of the way they can stabilize and become a nice boring piece of infrastructure. The talk of regular API changes every 3-5 years has me concerned. There's no reason for that.