From a data center perspective:
The article says this is about "sustained output token generation". For sustained usage, power is a huge factor for "real world performance". H100 has a peak power draw of 700W, while each of the RTX5090 has a peak power draw of 575W, for a total of 1150W.
According to the article it is 78 tokens per second for H100, and 80 tokens per second for the dual RTX5090. So you go up 400W of power in exchange for only two extra tokens per second.
Long story short there is a reason why data centers aren't using dual RTX5090 over H100. For sustained usage, you will pay for it in electricity, and in extra infrastructure to support that increased electricity draw, also extra heat generation and cooling.
Might make sense for a local personal hobby setup though.
> Might make sense for
Individual or small team use, personal or professional, when 32GB VRAM (or 32+32) is sufficient, at the cost of 3.5k$ (or 3.5+3.5) instead of 25k$.
The RTX 5090 has a VRAM cost of 110$/GB; the H100 of 310$/GB.
(And even professional use in a small team will probably not have the card run at full throttle and peak consumption all day, outside NN training projects.)
I would be surprised if 5090 uses more power per flop. Peak power isn't always representative. After all almost entirety of the power is used in matrix multiplication, and that depends on mostly the process and architecture version and 5090 is ahead in both.
The benchmark this article relies on is here[1], and is focused on a single-user case and uses QwQ-32B-AWQ with a context length of 4096 as far as I can see.
Which, from what I can gather, is a quite unrealistic setup for someone seriously considering buying a H100 for the home lab.
That said, the tensor parallelism[2] of vLLM is interesting.
[1]: https://github.com/DeutscheKI/llm-performance-tests#vllm-per...
[2]: https://rocm.blogs.amd.com/artificial-intelligence/tensor-pa...
I'm the one who did the original benchmarks. My goal is to have a coding assistant that runs fully offline in my basement. A context length of 4096 tokens is actually long enough for this model to generate and refactor a reasonably sized code file. I verified that with a prototype JetBrains plugin before I started the benchmarking. And keep in mind that you don't need to reserve context space for the code that you sent in, because vLLM will handle that with it's KV cache.
But yes, this is purely for a single user setup. And since H100 are optimized for cost-efficient hosting of multiple concurrent users, it's kind of obvious that they are not a good choice here.
Yeah was thinking more that if you're seriously considering a H100 in your basement it's either due to training or running such medium models with very long contexts for RAG or something.
Not that I'm an expert.
This is not a real world goal though because a real world user would just rent a GitHub Copilot or similar.
Unless, of course, NDAs prevent you from using outside cloud compute. I'm benchmarking possible on-prem solutions because I know that I'll need an on-prem AI. But yeah, I'm not a normal user.
Prices keep going up, if you're getting value from LLMs it is valuable expertise to be able to self-host. Right now everything is VC subsidized and so cost-benefit-analysis is warped.
don't forget about inference versus training workloads .. some people are making hardware e.g. www.cohere.com
I would immediately disqualify them: "The all-in-one platform for private and secure AI" and no pricing for purchasing hardware, only API keys and prices per token are mentioned.
That's basically "trust me, bro" and certainly not something I'd stake NDA compliance on.
actually yes, that web site is quite different than talking to the engineers in person.. Isn't that a tip of the iceburg indicator for the difference between engineering and senior management! I don't have any non-public details about that company.. to be clear