Item 43464849

abraxas • 6 days ago

Would a 40GB A6000 fully accommodate a 32B model? I assume an fp16 quantization is still necessary?

manmal • 6 days ago

At FP16 you‘d need 64GB just for the weights, and it‘d be 2x as slow as a Q8 version, likely with little improvement. You‘ll also need space for attention and context etc, so 80-100GB (or even more) VRAM would be better.

Many people „just“ use 4x consumer GPUs like the 3090 (24GB each) which scales well. They’d probably buy a mining rig, EPYC CPU, Mainboard with sufficient PCIe lanes, PCIe risers, 1600W PSU (might need to limit the GPUs to 300W), and 128GB RAM. Depending what you pay for the GPUs that‘ll be 3.5-4.5k

2 replies

postalrat • 6 days ago

I haven't found a good case/risers/etc I really like. Most the miner stuff wasn't made for PCIe 16x.

1 reply

manmal • 6 days ago

Is that a problem? According to this, the GPUs don’t communicate that much once the weights are loaded: https://github.com/turboderp/exllama/discussions/16#discussi...

> So at FP16 precision that's a grand total of 16 kB you're transmitting over the PCIe bus, once per token. If you multiply by, say, 20 tokens per second, then you're still only using like 0.1% of your PCIe bandwidth.

Intra GPU memory bandwidth is very important, but I‘ve seen lots of people use just a x4 lane and they didn’t complain much.

abraxas • 6 days ago

would it be better for energy efficiency and overall performance to use workstation cards like A5000 or A4000? Those can be found on eBay.

1 reply

manmal • 6 days ago

Looks like the A4000 has low memory bandwidth (50% of a 4090?) which is the limiting factor for inference usually. But they are efficient - if you can get them for cheap, probably a good entry setup? If you like running models that need a lot of VRAM, you‘ll likely run out of PCIe slots before you are done upgrading.

elorant • 6 days ago

You don't need 16-bit quantization. The difference in accuracy from 8-bit in most models is less than 5%.

1 reply

int_19h • 5 days ago

Even 4-bit is fine.

To be more precise, it's not that there's no decrease in quality, it's that with the RAM savings you can fit a much better model. E.g. with LLaMA, if you start with 70b and increasingly quantize, you'll still get considerably better performance at 3 bit than LLaMA 33b running at 8bit.

1 reply

elorant • 5 days ago

True. The only problem with lower quantization though is that the model fails to understand long prompts.