4o and 4.1 are not very good at coding
My best results are usually with 4o-mini-high, o3 is sometimes pretty good
I personally don’t like the canvas. I prefer the output on the chat
And a lot of times I say: provide full code for this file, or provide drop-in replacement (when I don’t want to deal with all the diffs). But usually at around 300-400 lines of code, it starts getting bad and then I need to refactor to break stuff up into multiple files (unless I can focus on just one method inside a file)
o3 is shockingly good actually. I can’t use it often due to rate limiting, so I save it for the odd occasion. Today I asked it how I could integrate a tree of Swift binary packages within an SDK, and detect internal version clashes, and it gave a very well researched and sensible overview. And gave me a new idea that I‘ll try.
I use o3 for anything math or coding related. 4o is good for things like, "my knee hurts when I do this and that -- what might it be?"
In ChatGPT, at this point I use 4o pretty much only for image generation; it's the one feature that's unique to it and is mind-blowingly good. For everything else, I default to o3.
For coding, I stick to Claude 3.5 / 3.7 and recently Gemini 2.5 Pro. I sometimes use o3 in ChatGPT when I can't be arsed to fire up Aider, or really need to use its search features to figure out how to do something (e.g. pinouts for some old TFT screens for ESP32 and Raspberry Pi, most recently).
Drop in replacement files per update should be done on the heavy test time compute methods.
o1-pro, o1-preview can generate updated full file responses into the 1k LOC range.
It's something about their internal verification methods that make it an actual viable development method.
True. Also, the APIs don't care too much about restricting output length, they might actually be more verbose to charge more
It's interesting how the same model being served through different interfaces (chat vs api), can behave differently based on the economic incentives of the providers