I'd like to see an honest attempt by someone to use one of these SOTA models to code an entire non-trivial app. Not a "vibe coding" flappy bird clone or minimal ioS app (call API to count calories in photo), but something real - say 10K LOC type of complexity, using best practices to give the AI all the context and guidance necessary. I'm not expecting the AI to replace the programmer - just to be a useful productivity tool when we move past demos and function writing to tackling real world projects.
It seems to me that where we are today, AI is only useful for coding for very localized tasks, and even there mostly where it's something commonplace and where the user knows enough to guide the AI when it's failing. I'm not at all convinced it's going to get much better until we have models that can actually learn (vs pre-trained) and are motivated to do so.
I use cursor agent mode with claude on my NextJS frontend and Typescript GraphQL backend. It's a real, reasonably sized, production app that's a few years old (pre-ChatGPT).
I vibe code the vast majority features nowadays. I generally don't need to write a single line of code. It often makes some mistakes but the agent figures out that the tests fail, or it doesn't build, fixes it, and basically "one shots" it after it doing its thing.
Only occasionally I need to write a few lines of code or give it a hint when it gets stuck. But 99% of the code is written by cursor.
When you say "vibe code" do you mean the true definition of that term, which is to blindly accept any code generated by the AI, see if it works (maybe agent mode does this) and move on to the next feature? Or do you mean prompt driven development, where although you are basically writing none of the code, you are still reading every line and maintain high involvement in the code base?
Kind of in between. I accept a lot of code without ever seeing it, but I check the critical stuff that could cause trouble. Or stuff that I know the AI is likely to mess up.
Specifically for the front end I mostly vibe code, and for the backend I review a lot of the code.
I will often follow up with prompts asking it to extract something to a function, or to not hardcode something.
That's pretty impressive - a genuine real-world use case where the AI is doing the vast majority of the work.
I made this NES emulator with Claude last week [0]. I'd say it was a pretty non-trivial task. It involved throwing a lot of NESDev docs, Disch mapper docs, and test rom output + assembly source code to the model to figure out.
I am considering training a custom Lora on atari roms and see if i could get a working game out of it with the Loras use. The thinking here is that atari, nes, snes, etc... roms are a lot smaller in size then a program that runs natively on whatever os. Lees lines of code to write for the LLM means less chance of a screw up. take the rom, convert it to assembly, perform very detailed captions on the rom and train.... if this works this would enable anyone to create games with one prompt which are a lot higher quality then the stuff being made now and with less complexity. If you made an emulator with the use of an llm, that means it understands assembly well enough so i think there might be hope for this idea.
Well the assembly I put into it was written by humans writing assembly intended to be well-understood by anyone reading it. On the contrary, many NES games abuse quirks specific to the NES that you can't translate to any system outside of the NES. Understanding what that assembly code is doing also requires a complete understanding of those quirks, which LLMs don't seem to have yet (My Mapper 4 implementation still has some bugs because my IRQ handling isn't perfect, and many games rely on precise IRQ timing).
How would you characterize the overall structural complexity of the project, and degree of novelty compared to other NES emulators Claude may have seen during training ?
I'd be a bit suspect of an LLM getting an emulator right, when all it has to go on is docs and no ability to test (since pass criteria is "behaves same as something you don't have access to")... Did you check to see the degree to which it may have been copying other NES emulators ?
> How would you characterize the overall structural complexity of the project, and degree of novelty compared to other NES emulators Claude may have seen during training ?
Highly complex, fairly novel.
Emulators themselves, for any chipset or system, have a very learnable structure: there are some modules, each having their own registers and ways of moving data between those registers, and perhaps ways to send interrupts between those modules. That's oversimplifying a bit, but if you've built an emulator once, you generally won't be blindsided when it comes to building another one. The bulk of the work lies in dissecting the hardware, which has already been done for the NES, and more open architectures typically have their entire pinouts and processes available online. All that to say - I don't think Claude would have difficulty implementing most emulators - it's good enough at programming and parsing assembly that as long as the underlying microprocessor architecture is known, it can implement it.
As far as other NES emulators goes, this project does many things in non-standard ways, for instance I use per-pixel rendering whereas many emulators use scanline rendering. I use an AudioWorklet with various mixing effects for audio, whereas other emulators use something much simpler or don't even bother fully implementing the APU. I can comfortably say there's no NES emulator out there written the way this one is written.
> I'd be a bit suspect of an LLM getting an emulator right, when all it has to go on is docs and no ability to test (since pass criteria is "behaves same as something you don't have access to")... Did you check to see the degree to which it may have been copying other NES emulators ?
Purely javascript-based NES emulators are few in number, and those that implement all aspects of the system even fewer, so I can comfortably say it doesn't copy any of the ones I've seen. I would be surprised if it did, since I came up with most of the abstractions myself and guided Claude heavily. While Claude can't get docs on it's own, I can. I put all the relevant documentation in the context window myself, along with the test rom output and source code. I'm still commanding the LLM myself, it's not like I told Claude to build an emulator and left it alone for 3 days.
Interesting - thanks!
Even with your own expert guidance, it does seem impressive that Claude was able complete a project like this without getting bogged down in the complexity.
I dunno what you would consider non trivial. I am building a diffing plugin for neovim. The experience is.. mixed. The fast progression at the start was impressive, but now as the code base have grown the issues show up. The code is a mess. Adding one feature breaks another and so on. I have no problem in using the agent on code that I know very well, because I can stir it in the exact direction I want. But vibe coding something I don't fully understand is a pain.
I've been using Claude 3.7 for various things, including helping in game development tasks. The generated code usually requires editing and it can't do autonomously more than a few functions at once but it's a fairly useful tool in terms of productivity. And the logic part is also quite good, can design out various ideas/algorithms, and suggest some optimisations.
Tech stack is nothing fancy/rare but not the usual ReactJS slop either - it's C# with OpenGL.
I can't comment about the best practices though because my codebase follows none of them.
Yes, the user has to know enough to guide the AI when it's failing. So it can't exactly replace the programmer as it is now.
It really can't do niche stuff however - like SIMD. Maybe it would be better if I compiled a cheatsheet of .NET SIMD snippets and howtos because this stuff isn't really on the internet in a coherent form at all. So it's highly unlikely that it was trained on that.
Interesting - thanks! This isn't the type of tech stack where I'd have expected it to do very well, so the fact that you're at least finding it to be productive is encouraging, although the (only) "function level competency" is similar to what I've experienced - enough to not have been encouraged to try anything more complex.
I know they are capable of more, but I also tire of people being so enamored with "bootstrap a brand new app" type AI coding - like is that even a big part of our job? In 25 years of dev work, I've needed to do that for commercial production app like... twice? 3 times? Help me deal with existing apps and codebases please.
I'm at 3k LOC on a current Rust project I'm mostly vibe coding with my very limited free time. Will share when I hit 10k :)
Would you mind sharing what the project is, and which AI you are using? No sign so far of AI's usefulness slowing down as the complexity increases?
>Would you mind sharing what the project is
rust + wasm simulation of organisms in an ecosystem, with evolving neural networks and genes. super fun to build and watch.
>which AI you are using?
using chatgpt/claude/gemini with a custom tool i built similar to aider / claude code, except it's very interactive, like chatting with the AI as it suggests changes that I approve/decline.
>No sign so far of AI's usefulness slowing down as the complexity increases?
The AI is not perfect, there are some cases where it is unable so solve a challenging issue and i must help it solve the issue. this usually happens for big sweeping changes that touch all over the codebase. It introduces bugs, but it can also debug them easily, especially with the increased compile-time checking in rust. runtime bugs are harder, because i have to tell the ai the behavior i observe. iterating on UI design is clumsy and it's often faster for me to just iterate by making changes myself instead.
Thanks - sounds like a fun project!
Given that you've built your own coding tool, I assume this is as much about testing what AI can do as it is about the project itself? Is it a clear win as far as productivity goes?
I'm most interested in building cool projects, and I have found AI to be a major multiplier to that effort. One of those cool projects was a custom coding tool, which I now use with all my projects, and continue to polish as I use it.
As far as productivity, it's hard for me to quantify, but most of these projects would not be feasible for me to pursue with my limited free time without the force multiplier of AI.
No one links their ai code, you noticed?