fxtentacle 5 days ago

I'm the one who did the original benchmarks. My goal is to have a coding assistant that runs fully offline in my basement. A context length of 4096 tokens is actually long enough for this model to generate and refactor a reasonably sized code file. I verified that with a prototype JetBrains plugin before I started the benchmarking. And keep in mind that you don't need to reserve context space for the code that you sent in, because vLLM will handle that with it's KV cache.

But yes, this is purely for a single user setup. And since H100 are optimized for cost-efficient hosting of multiple concurrent users, it's kind of obvious that they are not a good choice here.

2
magicalhippo 5 days ago

Yeah was thinking more that if you're seriously considering a H100 in your basement it's either due to training or running such medium models with very long contexts for RAG or something.

Not that I'm an expert.

koakuma-chan 5 days ago

This is not a real world goal though because a real world user would just rent a GitHub Copilot or similar.

fxtentacle 5 days ago

Unless, of course, NDAs prevent you from using outside cloud compute. I'm benchmarking possible on-prem solutions because I know that I'll need an on-prem AI. But yeah, I'm not a normal user.

tmpz22 5 days ago

Prices keep going up, if you're getting value from LLMs it is valuable expertise to be able to self-host. Right now everything is VC subsidized and so cost-benefit-analysis is warped.