GPT-4.1 probably is a distilled version of GPT-4.5
I dont understand the constant complaining about naming conventions. The number system differentiates the models based on capability, any other method would not do that. After ten models with random names like "gemini", "nebula" you would have no idea which is which. Its a low IQ take. You dont name new versions of software as completely different software
Also, Yesterday, using v0, I replicated a full nextjs UI copying a major saas player. No backend integration, but the design and UX were stunning, and better than I could do if I tried. I have 15 years of backend experience at FAANG. Software will get automated, and it already is, people just havent figured it out yet
> The number system differentiates the models based on capability, any other method would not do that.
Please rank GPT-4, GPT-4 Turbo, GPT-4o, GPT-4.1-nano, GPT-4.1-mini, GPT-4.1, GPT-4.5, o1-mini, o1, o1 pro, o3-mini, o3-mini-high, o3, and o4-mini in terms of capability without consulting any documentation.
Btw, as someone who agrees with your point, what’s the actual answer to this?
Of these, some are mostly obsolete: GPT-4 and GPT-4 Turbo are worse than GPT-4o in both speed and capabilities. o1 is worse than o3-mini-high in most aspects.
Then, some are not available yet: o3 and o4-mini. GPT-4.1 I haven't played with enough to give you my opinion on.
Among the rest, it depends on what you're looking for:
Multi-modal: GPT-4o > everything else
Reasoning: o1-pro > o3-mini-high > o3-mini
Speed: GPT-4o > o3-mini > o3-mini-high > o1-pro
(My personal favorite is o3-mini-high for most things, as it has a good tradeoff between speed and reasoning. Although I use 4o for simpler queries.)
So where was o1-pro in the comparisons in OpenAI's article? I just don't trust any of these first party benchmarks any more.
It depends on how you define "capability" since that's different for reasoning and nonreasoning models.
Whats the problem, for the layman it doesnt actually matter, and for the experts, its usually very obvious which model to use.
LLMs fundamentally have the same contraints no matter how much juice you give them or how much you toy with the models.
That’s not true. I’m a layman and 4.5 is obviously better than 4o for me, definitely enough to matter.
You are definitely not a layman if you know the difference between 4.5 and 4o. The average user thinks ai = openai = chatgpt.
Well, okay, but I'm certainly not an expert who knows the fine differences between all the models available on chat.com. So I'm somewhere between your definition of "layman" and your definition of "expert" (as are, I suspect, most people on this forum).
If you know the difference between 4.5 and 4o, it'll take you 20 minutes max to figure out the theoretical differences between the other models, which is not bad for a highly technical emerging field.
There's no single ordering -- it really depends on what you're trying to do, how long you're willing to wait, and what kinds of modalities you're interested in.
I recognize this is a somewhat rhetorical question and your point is well taken. But something that maps well is car makes and models:
- Is Ford Better than Chevy? (Comparison across providers) It depends on what you value, but I guarantee there's tribes that are sure there's only one answer.
- Is the 6th gen 2025 4Runner better than 5th gen 2024 4Runner? (Comparison of same model across new releases) It depends on what you value. It is a clear iteration on the technology, but there will probably be more plastic parts that will annoy you as well.
- Is the 2025 BMW M3 base model better than the 2022 M3 Competition (Comparing across years and trims)? Starts to depend even more on what you value.
Providers need to delineate between releases, and years, models, and trims help do this. There are companies that will try to eschew this and go the Tesla route without models years, but still can't get away from it entirely. To a certain person, every character in "2025 M3 Competition xDrive Sedan" matters immensely, to another person its just gibberish.
But a pure ranking isn't the point.
Yes, point taken.
However, it's still not as bad as Intel CPU naming in some generations or USB naming (until very recently). I know, that's a very low bar... :-)
Very easy with the naming system?
I meant this is actually straight-forward if you've been paying even the remotest of attention.
Chronologically:
GPT-4, GPT-4 Turbo, GPT-4o, o1-preview/o1-mini, o1/o3-mini/o3-mini-high/o1-pro, gpt-4.5, gpt-4.1
Model iterations, by training paradigm:
SGD pretraining with RLHF: GPT-4 -> turbo -> 4o
SGD pretraining w/ RL on verifiable tasks to improve reasoning ability: o1-preview/o1-mini -> o1/o3-mini/o3-mini-high (technically the same product with a higher reasoning token budget) -> o3/o4-mini (not yet released)
reasoning model with some sort of Monte Carlo Search algorithm on top of reasoning traces: o1-pro
Some sort of training pipeline that does well with sparser data, but doesn't incorporate reasoning (I'm positing here, training and architecture paradigms are not that clear for this generation): gpt-4.5, gpt-4.1 (likely fine-tuned on 4.5)
By performance: hard to tell! Depends on what your task is, just like with humans. There are plenty of benchmarks. Roughly, for me, the top 3 by task are:
Creative Writing: gpt-4.5 -> gpt-4o
Business Comms: o1-pro -> o1 -> o3-mini
Coding: o1-pro -> o3-mini (high) -> o1 -> o3-mini (low) -> o1-mini-preview
Shooting the shit: gpt-4o -> o1
It's not to dismiss that their marketing nomenclature is bad, just to point out that it's not that confusing for people that are actively working with these models have are a reasonable memory of the past two years.
> You dont name new versions of software as completely different software
macOS releases would like a word with you.
https://en.wikipedia.org/wiki/MacOS#Timeline_of_releases
Technically they still have numbers, but Apple hides them in marketing copy.
Though they still have “macOS” in the name. I’m being tongue-in-cheek.
> Yesterday, using v0, I replicated a full nextjs UI copying a major saas player. No backend integration, but the design and UX were stunning, and better than I could do if I tried.
Exactly. Those who do frontend or focus on pretty much anything Javascript are, how should I say it? Cooked?
> Software will get automated
The first to go are those that use JavaScript / TypeScript engineers have already been automated out of a job. It is all over for them.
Yeah its over for them. Complicated business logic and sprawling systems are what are keeping backend safe for now. But the big front end code bases where individual files (like react components) are largely decoupled from the rest of the code base is why front end is completely cooked
I have a medium-sized typescript personal project I work on. It probably has 20k LOC of well organized typescript (react frontend, express backend). I also have somewhat comprehensive docs and cursor project rules.
In general I use Cursor in manual mode asking it to make very well scoped small changes (e.g. “write this function that does this in this exact spot”). Yesterday I needed to make a largely mechanical change (change a concept in the front end, make updates to the corresponding endpoints, update the data access methods, update the database schema).
This is something very easy I would expect a junior developer to be able to accomplish. It is simple, largely mechanical, but touches a lot of files. Cursor agent mode puked all over itself using Gemini 2.5. It could summarize what changes would need to be made, but it was totally incapable of making the changes. It would add weird hard coded conditions, define new unrelated files, not follow the conventions of the surrounding code at all.
TLDR; I think LLMs right now are good for greenfield development (create this front end from scratch following common patterns), and small scoped changes to a few files. If you have any kind of medium sized refactor on an existing code base forget about it.
> Cursor agent mode puked all over itself using Gemini 2.5. It could summarize what changes would need to be made, but it was totally incapable of making the changes.
Gemini 2.5 is currently broken with the Cursor agent; it doesn't seem to be able to issue tool calls correctly. I've been using Gemini to write plans, which Claude then executes, and this seems to work well as a workaround. Still unfortunate that it's like this, though.
Interesting, I’ve found Gemini better than Claude so I defaulted to that. I’ll try another refactor in agent mode with Claude.
My personal opinion is leveraging LLMs on a large code base requires skill. How you construct the prompt, and what you keep in context, which model you use, all have a large effect on the output. If you just put it into cursor and throw your hands up, you probably didnt do it right
I gave it a list of the changes I needed and pointed it to the area of the different files that needed updated. I also have comprehensive cursor project rules. If I needed to hand hold any more than that it would take considerably less time to just make the changes myself.
> Software will get automated, and it already is, people just havent figured it out yet
To be honest I think this is most AI labs (particularly the American ones) not-so-secret goal now, for a number of strong reasons. You can see it in this announcements, Anthrophic's recent Claude 3.7 announcement, OpenAI's first planned agent (SWE-Agent), etc etc. They have to justify their worth somehow and they see it as a potential path to do that. Remains to be seen how far they will get - I hope I'm wrong.
The reasons however for picking this path IMO are:
- Their usage statistics show coding as the main user: Anthrophic recently released their stats. Its become the main usage of these models, with other usages at best being novelty or conveniences for people in relative size. Without this market IMO the hype would of already fizzled awhile ago at best a novelty when looking at the rest of the user base size.
- They "smell blood" to disrupt and fear is very effective to promote their product: This IMO is the biggest one. Disrupting software looks to be an achievable goal, but it also is a goal that has high engagement compared to other use cases. No point solving something awesome if people don't care, or only care for awhile (e.g. meme image generation). You can see the developers on this site and elsewhere in fear. Fear is the best marketing tool ever and engagement can last years. It keeps people engaged and wanting to know more; and talking about how "they are cooked" almost to the exclusion of everything else (i.e. focusing on the threat). Nothing motivates you to know a product more than not being able to provide for yourself, your family, etc to the point that most other tech topics/innovations are being drowned out by AI announcements.
- Many of them are losing money and need a market to disrupt: Currently the existing use cases of a chat bot are not yet impressive enough (or haven't been till very recently) to justify the massive valuations of these companies. Its coding that is allowing them to bootstrap into other domains.
- It is a domain they understand: AI dev's know models, they understand the software process. It may be a complex domain requiring constant study, but they know it back to front. This makes it a good first case for disruption where the data, and the know how is already with the teams.
TL;DR: They are coming after you, because it is a big fruit that is easier to pick for them than other domains. Its also one that people will notice either out of excitement (CEO, VC's, Management, etc) or out of fear (tech workers, academics, other intellectual workers).
> I don't understand the constant complaining about naming conventions.
Oh man. Unfolding my lawn chair and grabbing a bucket of popcorn for this discussion.
> using v0, I replicated a full nextjs UI copying a major saas player. No backend integration, but the design and UX were stunning
AI is amazing, now all you need to create a stunning UI is for someone else to make it first so an AI can rip it off. Not beating the "plagiarism machine" allegations here.
Heres a secret: Most of the highest funded VC backed software companies are just copying a competitor with a slight product spin/different pricing model
> Jim Barksdale, used to say there’s only two ways to make money in business: One is to bundle; the other is unbundle
https://a16z.com/the-future-of-work-cars-and-the-wisdom-in-s...