Item 43466059

elorant • 6 days ago

You don't need 16-bit quantization. The difference in accuracy from 8-bit in most models is less than 5%.

int_19h • 5 days ago

Even 4-bit is fine.

To be more precise, it's not that there's no decrease in quality, it's that with the RAM savings you can fit a much better model. E.g. with LLaMA, if you start with 70b and increasingly quantize, you'll still get considerably better performance at 3 bit than LLaMA 33b running at 8bit.

1 reply

elorant • 5 days ago

True. The only problem with lower quantization though is that the model fails to understand long prompts.