emmelaich 4 days ago

Generally speaking, how can you tell how much vram a model will take? It seems to be a valuable bit of data which is missing from downloadable models (gguf) files.

2
omneity 4 days ago

Very rougly you can consider the Bs of a model as GBs of memory then it depends on the quantization level. Say for an 8B model:

- FP16: 2x 8GB = 16GB

- Q8: 1x 8GB

- Q4: 0.5x 8GB = 4GB

It doesn't 100% neatly map like this but this gives you a rough measure. In top of this you need some more memory depending on the context length and some other stuff.

Rationale for the calculation above: A model is basically a billions of variables with a floating number value. So the size of a model roughly maps to number of variables (weights) x word-precision of each variable (4, 8, 16bits..)

You don't have to quantize all layers to the same precision this is why sometimes you see fractional quantizations like 1.58bits.

rhdunn 4 days ago

The 1.58bit quantization is using 3 values -- -1, 0, 1. The bits number comes from log_2(3) = 1.58....

For that level you can pack 4 weights in a byte using 2 bits per byte. However, there is one bit configuration in each that is unused.

More complex packing arrangements are done by grouping weights together (e.g. a group of 3) and assigning a bit configuration to each combination of values into a lookup table. This allows greater compression closer to the 1.68 bits value.

fennecfoxy 4 days ago

Depends on quantization etc. But there are good calculators that will calculate for your KV cache etc as well: https://apxml.com/tools/vram-calculator.