Oh, I have a question, maybe you know.
Assuming the same model sizes in gigabytes, which one to choose: a higher-B lower-bit or a lower-B higher-bit? Is there a silver bullet? Like “yeah always take 4-bit 13B over 8-bit 7B”.
Or are same-sized models basically equal in this regard?
I would say 9 times out of 10, you will get better results from a Q4 model that’s a size class larger than a smaller model at Q8. But it’s best not to go below Q4.
My understanding is that models are currently undertrained and not very "dense", so Q4 doesn't hurt very much now but it may in future denser models.
That may well be true. I know that earlier models like Llama 1 65B could tolerate more aggressive quantization, which supports that idea.