> The number system differentiates the models based on capability, any other method would not do that.
Please rank GPT-4, GPT-4 Turbo, GPT-4o, GPT-4.1-nano, GPT-4.1-mini, GPT-4.1, GPT-4.5, o1-mini, o1, o1 pro, o3-mini, o3-mini-high, o3, and o4-mini in terms of capability without consulting any documentation.
Btw, as someone who agrees with your point, what’s the actual answer to this?
Of these, some are mostly obsolete: GPT-4 and GPT-4 Turbo are worse than GPT-4o in both speed and capabilities. o1 is worse than o3-mini-high in most aspects.
Then, some are not available yet: o3 and o4-mini. GPT-4.1 I haven't played with enough to give you my opinion on.
Among the rest, it depends on what you're looking for:
Multi-modal: GPT-4o > everything else
Reasoning: o1-pro > o3-mini-high > o3-mini
Speed: GPT-4o > o3-mini > o3-mini-high > o1-pro
(My personal favorite is o3-mini-high for most things, as it has a good tradeoff between speed and reasoning. Although I use 4o for simpler queries.)
So where was o1-pro in the comparisons in OpenAI's article? I just don't trust any of these first party benchmarks any more.
It depends on how you define "capability" since that's different for reasoning and nonreasoning models.
Whats the problem, for the layman it doesnt actually matter, and for the experts, its usually very obvious which model to use.
LLMs fundamentally have the same contraints no matter how much juice you give them or how much you toy with the models.
That’s not true. I’m a layman and 4.5 is obviously better than 4o for me, definitely enough to matter.
You are definitely not a layman if you know the difference between 4.5 and 4o. The average user thinks ai = openai = chatgpt.
Well, okay, but I'm certainly not an expert who knows the fine differences between all the models available on chat.com. So I'm somewhere between your definition of "layman" and your definition of "expert" (as are, I suspect, most people on this forum).
If you know the difference between 4.5 and 4o, it'll take you 20 minutes max to figure out the theoretical differences between the other models, which is not bad for a highly technical emerging field.
There's no single ordering -- it really depends on what you're trying to do, how long you're willing to wait, and what kinds of modalities you're interested in.
I recognize this is a somewhat rhetorical question and your point is well taken. But something that maps well is car makes and models:
- Is Ford Better than Chevy? (Comparison across providers) It depends on what you value, but I guarantee there's tribes that are sure there's only one answer.
- Is the 6th gen 2025 4Runner better than 5th gen 2024 4Runner? (Comparison of same model across new releases) It depends on what you value. It is a clear iteration on the technology, but there will probably be more plastic parts that will annoy you as well.
- Is the 2025 BMW M3 base model better than the 2022 M3 Competition (Comparing across years and trims)? Starts to depend even more on what you value.
Providers need to delineate between releases, and years, models, and trims help do this. There are companies that will try to eschew this and go the Tesla route without models years, but still can't get away from it entirely. To a certain person, every character in "2025 M3 Competition xDrive Sedan" matters immensely, to another person its just gibberish.
But a pure ranking isn't the point.
Yes, point taken.
However, it's still not as bad as Intel CPU naming in some generations or USB naming (until very recently). I know, that's a very low bar... :-)
Very easy with the naming system?
I meant this is actually straight-forward if you've been paying even the remotest of attention.
Chronologically:
GPT-4, GPT-4 Turbo, GPT-4o, o1-preview/o1-mini, o1/o3-mini/o3-mini-high/o1-pro, gpt-4.5, gpt-4.1
Model iterations, by training paradigm:
SGD pretraining with RLHF: GPT-4 -> turbo -> 4o
SGD pretraining w/ RL on verifiable tasks to improve reasoning ability: o1-preview/o1-mini -> o1/o3-mini/o3-mini-high (technically the same product with a higher reasoning token budget) -> o3/o4-mini (not yet released)
reasoning model with some sort of Monte Carlo Search algorithm on top of reasoning traces: o1-pro
Some sort of training pipeline that does well with sparser data, but doesn't incorporate reasoning (I'm positing here, training and architecture paradigms are not that clear for this generation): gpt-4.5, gpt-4.1 (likely fine-tuned on 4.5)
By performance: hard to tell! Depends on what your task is, just like with humans. There are plenty of benchmarks. Roughly, for me, the top 3 by task are:
Creative Writing: gpt-4.5 -> gpt-4o
Business Comms: o1-pro -> o1 -> o3-mini
Coding: o1-pro -> o3-mini (high) -> o1 -> o3-mini (low) -> o1-mini-preview
Shooting the shit: gpt-4o -> o1
It's not to dismiss that their marketing nomenclature is bad, just to point out that it's not that confusing for people that are actively working with these models have are a reasonable memory of the past two years.