The benchmark this article relies on is here[1], and is focused on a single-user case and uses QwQ-32B-AWQ with a context length of 4096 as far as I can see.
Which, from what I can gather, is a quite unrealistic setup for someone seriously considering buying a H100 for the home lab.
That said, the tensor parallelism[2] of vLLM is interesting.
[1]: https://github.com/DeutscheKI/llm-performance-tests#vllm-per...
[2]: https://rocm.blogs.amd.com/artificial-intelligence/tensor-pa...
I'm the one who did the original benchmarks. My goal is to have a coding assistant that runs fully offline in my basement. A context length of 4096 tokens is actually long enough for this model to generate and refactor a reasonably sized code file. I verified that with a prototype JetBrains plugin before I started the benchmarking. And keep in mind that you don't need to reserve context space for the code that you sent in, because vLLM will handle that with it's KV cache.
But yes, this is purely for a single user setup. And since H100 are optimized for cost-efficient hosting of multiple concurrent users, it's kind of obvious that they are not a good choice here.
Yeah was thinking more that if you're seriously considering a H100 in your basement it's either due to training or running such medium models with very long contexts for RAG or something.
Not that I'm an expert.
This is not a real world goal though because a real world user would just rent a GitHub Copilot or similar.
Unless, of course, NDAs prevent you from using outside cloud compute. I'm benchmarking possible on-prem solutions because I know that I'll need an on-prem AI. But yeah, I'm not a normal user.
Prices keep going up, if you're getting value from LLMs it is valuable expertise to be able to self-host. Right now everything is VC subsidized and so cost-benefit-analysis is warped.
don't forget about inference versus training workloads .. some people are making hardware e.g. www.cohere.com
I would immediately disqualify them: "The all-in-one platform for private and secure AI" and no pricing for purchasing hardware, only API keys and prices per token are mentioned.
That's basically "trust me, bro" and certainly not something I'd stake NDA compliance on.
actually yes, that web site is quite different than talking to the engineers in person.. Isn't that a tip of the iceburg indicator for the difference between engineering and senior management! I don't have any non-public details about that company.. to be clear