simonw 6 days ago

32B is one of my favourite model sizes at this point - large enough to be extremely capable (generally equivalent to GPT-4 March 2023 level performance, which is when LLMs first got really useful) but small enough you can run them on a single GPU or a reasonably well specced Mac laptop (32GB or more).

9
faizshah 6 days ago

I just started self hosting as well on my local machine, been using https://lmstudio.ai/ Locally for now.

I think the 32b models are actually good enough that I might stop paying for ChatGPT plus and Claude.

I get around 20 tok/second on my m3 and I can get 100 tok/second on smaller models or quantized. 80-100 tok/second is the best for interactive usage if you go above that you basically can’t read as fast as it generates.

I also really like the QwQ reaoning model, I haven’t gotten around to try out using locally hosted models for Agents and RAG especially coding agents is what im interested in. I feel like 20 tok/second is fine if it’s just running in the background.

Anyways would love to know others experiences, that was mine this weekend. The way it’s going I really dont see a point in paying, I think on-device is the near future and they should just charge a licensing fee like DB provider for enterprise support and updates.

If you were paying $20/mo for ChatGPT 1 year ago, the 32b models are basically at that level but slightly slower and slightly lower quality but useful enough to consider cancelling your subscriptions at this point.

wetwater 6 days ago

Are there any good sources that I can read up on estimiating what would be hardware specs required for 7B, 13B, 32B .. etc size If I need to run them locally? I am grad student on budget but I want to host one locally and trying to build a PC that could run one of these models.

coder543 6 days ago

"B" just means "billion". A 7B model has 7 billion parameters. Most models are trained in fp16, so each parameter takes two bytes at full precision. Therefore, 7B = 14GB of memory. You can easily quantize models to 8 bits per parameter with very little quality loss, so then 7B = 7GB of memory. With more quality loss (making the model dumber), you can quantize to 4 bits per parameter, so 7B = 3.5GB of memory. There are ways to quantize at other levels too, anywhere from under 2 bits per parameter up to 6 bits per parameter are common.

There is additional memory used for context / KV cache. So, if you use a large context window for a model, you will need to factor in several additional gigabytes for that, but it is much harder to provide a rule of thumb for that overhead. Most of the time, the overhead is significantly less than the size of the model, so not 2x or anything. (The size of the context window is related to the amount of text/images that you can have in a conversation before the LLM begins forgetting the earlier parts of the conversation.)

The most important thing for local LLM performance is typically memory bandwidth. This is why GPUs are so much faster for LLM inference than CPUs, since GPU VRAM is many times the speed of CPU RAM. Apple Silicon offers rather decent memory bandwidth, which makes the performance fit somewhere between a typical Intel/AMD CPU and a typical GPU. Apple Silicon is definitely not as fast as a discrete GPU with the same amount of VRAM.

That's about all you need to know to get started. There are obviously nuances and exceptions that apply in certain situations.

A 32B model at 5 bits per parameter will comfortably fit onto a 24GB GPU and provide decent speed, as long as the context window isn't set to a huge value.

wruza 6 days ago

Oh, I have a question, maybe you know.

Assuming the same model sizes in gigabytes, which one to choose: a higher-B lower-bit or a lower-B higher-bit? Is there a silver bullet? Like “yeah always take 4-bit 13B over 8-bit 7B”.

Or are same-sized models basically equal in this regard?

anon373839 5 days ago

I would say 9 times out of 10, you will get better results from a Q4 model that’s a size class larger than a smaller model at Q8. But it’s best not to go below Q4.

nenaoki 5 days ago

My understanding is that models are currently undertrained and not very "dense", so Q4 doesn't hurt very much now but it may in future denser models.

anon373839 5 days ago

That may well be true. I know that earlier models like Llama 1 65B could tolerate more aggressive quantization, which supports that idea.

epolanski 6 days ago

So, in essence, all AMD does to launch a successful GPU in inference space is to load it with ram?

TrueDuality 6 days ago

AMD's limitation is more of a software problem than a hardware problem at this point.

AuryGlenz 6 days ago

But it’s still surprising they haven’t. People would be motivated as hell if they launched GPUs with twice the amount of VRAM. It’s not as simple as just soldering some more in but still.

wruza 6 days ago

AMD “just” has to write something like CUDA overnight. Imagine you’re in 1995 and have to ship Kubuntu 24.04 LTS this summer running on your S3 Virge.

mirekrusin 5 days ago

They don't need to do anything software wise, inference is solved problem for AMD.

thomastjeffery 5 days ago

They sort of have. I'm using a 7900xtx, which has 24gb of vram. The next competitor would be a 4090, which would cost more than double today; granted, that would be much faster.

Technically there is also the 3090, which is more comparable price wise. I don't know about performance, though.

VRAM is supply limited enough that going bigger isn't as easy as it sounds. AMD can probably sell as much as they get their hands on, so they may as well still more GPUs, too.

regularfry 5 days ago

Funnily enough you can buy GPUs where someone has done exactly that: solder extra VRAM into a stock model.

yencabulator 5 days ago

Or let go of the traditional definition of a GPU, and go integrated. AMD Ryzen AI Max+ 395 with 128GB RAM is a promising start.

faizshah 6 days ago

Go to r/LocalLLAMA they have the most info. There’s also lots of good YouTube channels who have done benchmarks on Mac minis for this (another good value one with student discount).

Since you’re a student most of the providers/clouds offer student credits and you can also get loads of credits from hackathons.

disgruntledphd2 6 days ago

MacBook with 64gb RAM will probably be the easiest. As a bonus, you can train pytorch models on the built in GPU.

It's really frustrating that I can't just write off Apple as evil monopolists when they put out hardware like this.

p_l 6 days ago

Generally, unquantized - double the number and that's the amount of VRAM in GB you need + some extra, because most models use fp16 weights so it's 2 bytes per parameter -> 32B parameters = 64GB

typical quantization to 4bit will cut 32B model into 16GB of weights plus some of the runtime data, which makes it possibly usable (if slow) on 16GB GPU. You can sometimes viably use smaller quantizations, which will reduce memory use even more.

regularfry 5 days ago

You always want a bit of headroom for context. It's a problem I keep bumping into with 32B models on a 24GB card: the decent quants fit, but the context you have available on the card isn't quite as much as I'd like.

randomNumber7 6 days ago

Yes. You multiply the number of parameters with the number of bytes per parameter and compare it with the amount of GPU memory (or CPU RAM) you have.

regularfry 5 days ago

Qwq:32b + qwen2.5-coder:32b is a nice combination for aider, running locally on a 4090. It has to swap models between architect and edit steps so it's not especially fast, but it's capable enough to be useful. qwen2.5-coder does screw up the edit format sometimes though, which is a pain.

pixelHD 6 days ago

what spec is your local mac?

wetwater 6 days ago

I've only recently started looking into running these models locally on my system. I have limited knowledge regarding LLMs and even more limited when it comes to building my own PC.

Are there any good sources that I can read up on estimiating what would be hardware specs required for 7B, 13B, 32B .. etc size If I need to run them locally?

TechDebtDevin 6 days ago

VRAM Required = Number of Parameters (in billions) × Number of Bytes per Parameter × Overhead[0].

[0]: https://twm.me/posts/calculate-vram-requirements-local-llms/

manmal 6 days ago

Don’t forget to add a lot of extra space if you want a usable context size.

TechDebtDevin 6 days ago

Wouldn't that be your overhead var

wetwater 6 days ago

Thats neat! thanks

YetAnotherNick 6 days ago

I don't think these models are GPT-4 level. Yes they seem to be on benchmarks, but it has been known that models increasingly use A/B testing in dataset curation and synthesis(using GPT 4 level models) to optimize not just the benchmarks but things which could be benchmarked like academics.

simonw 6 days ago

I'm not talking about GPT-4o here - every benchmark I've seen has had the new models from the past ~12 months out-perform the March 2023 GPT-4 model.

To pick just the most popular one, https://lmarena.ai/?leaderboard= has GPT-4-0314 ranked 83rd now.

th0ma5 6 days ago

How have you been able to tie benchmark results to better results?

simonw 6 days ago

Vibes and intuition. Not much more than that.

th0ma5 5 days ago

Don't you think that presenting this as learning or knowledge is unethical?

tosh 5 days ago

Also "GPT-4 level" is a bit loaded. One way to think about it that I found helpful is to split how good a model is into "capability" and "knowledge/hallucination".

Many benchmarks test "capability" more than "knowledge". There are many use cases where the model gets all the necessary context in the prompt. There a model with good capability for the use case will do fine (e.g. as good as GPT-4).

That same model might hallucinate when you ask about the plot of a movie while a larger model like GPT-4 might be able to recall better what the movie is about.

Tepix 5 days ago

32B is also great for two 24GB GPUs if you want a nice context size and/or Q8 quantization which is usually very good.

int_19h 5 days ago

I don't think there's any local model other than full-sized DeepSeek (not distillations!) that is on the level of the original GPT-4, at least not in reasoning tasks. Scoreboards lie.

That aside, QwQ-32 is amazingly smart for its size.

clear_view 6 days ago

32B don't fully fit 16GB of VRAM. Still fine for higher quality answers, worth the extra wait in some cases.

abraxas 6 days ago

Would a 40GB A6000 fully accommodate a 32B model? I assume an fp16 quantization is still necessary?

manmal 6 days ago

At FP16 you‘d need 64GB just for the weights, and it‘d be 2x as slow as a Q8 version, likely with little improvement. You‘ll also need space for attention and context etc, so 80-100GB (or even more) VRAM would be better.

Many people „just“ use 4x consumer GPUs like the 3090 (24GB each) which scales well. They’d probably buy a mining rig, EPYC CPU, Mainboard with sufficient PCIe lanes, PCIe risers, 1600W PSU (might need to limit the GPUs to 300W), and 128GB RAM. Depending what you pay for the GPUs that‘ll be 3.5-4.5k

postalrat 6 days ago

I haven't found a good case/risers/etc I really like. Most the miner stuff wasn't made for PCIe 16x.

manmal 6 days ago

Is that a problem? According to this, the GPUs don’t communicate that much once the weights are loaded: https://github.com/turboderp/exllama/discussions/16#discussi...

> So at FP16 precision that's a grand total of 16 kB you're transmitting over the PCIe bus, once per token. If you multiply by, say, 20 tokens per second, then you're still only using like 0.1% of your PCIe bandwidth.

Intra GPU memory bandwidth is very important, but I‘ve seen lots of people use just a x4 lane and they didn’t complain much.

abraxas 6 days ago

would it be better for energy efficiency and overall performance to use workstation cards like A5000 or A4000? Those can be found on eBay.

manmal 6 days ago

Looks like the A4000 has low memory bandwidth (50% of a 4090?) which is the limiting factor for inference usually. But they are efficient - if you can get them for cheap, probably a good entry setup? If you like running models that need a lot of VRAM, you‘ll likely run out of PCIe slots before you are done upgrading.

elorant 6 days ago

You don't need 16-bit quantization. The difference in accuracy from 8-bit in most models is less than 5%.

int_19h 5 days ago

Even 4-bit is fine.

To be more precise, it's not that there's no decrease in quality, it's that with the RAM savings you can fit a much better model. E.g. with LLaMA, if you start with 70b and increasingly quantize, you'll still get considerably better performance at 3 bit than LLaMA 33b running at 8bit.

elorant 5 days ago

True. The only problem with lower quantization though is that the model fails to understand long prompts.

buyucu 5 days ago

I prefer 24b because it's the largest model I can run on a 16GB laptop :)

redrove 6 days ago

Or quantized on a 4090!

osti 6 days ago

Are 5090's able to run 32B models?

regularfry 5 days ago

The 4090 can run 32B models in Q4_K_M, so yes, on that measure. Not unquantised though, nothing bigger than Q8 would fit. On a 32GB card you'll have more choices to trade off quantisation against context.