> In general, quantizing down to 6 bits gives no measurable loss in performance....

alex43578 · 2026-02-17T07:03:52 1771311832

NVIDIA is showing training at 4 bits (NVPF4), and 4 bit quants have been standard for running LLMs at home for quite a while because performance was good enough.

lambda · 2026-02-17T19:59:34 1771358374

I mean, GPT-OSS is delivered as a 4 bit model; and apparently they even trained it at 4 bits. Many train at 16 bits because it provides improved stability for gradient descent, but there are methods that allow even training at smaller quantizations efficiently.

There was a paper that I had been looking at, that I can't find right now, that demonstrated what I mentioned, it showed only imperceptible changes down to 6 bit quants, then performance decreasing more and more rapidly until it crossed over the next smaller model at 1 bit. But unfortunately, I can't seem to find it again.

There's this article from Unsloth, where they show MMLU scores for quantized Llama 4 models. They are of an 8 bit base model, so not quite the same as comparing to 16 bit models, but you see no reduction in score at 6 bits, while it starts falling after that. https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs/uns...

Anyhow, like anything in machine learning, if you want to be certain, you probably need to run your own evals. But when researching, I found enough evidence that down to 6 bit quants you really lose very little performance, and even at much smaller quants the number of parameters tends to be more important than the quantization, all the way down to 2 bits, that it acts as a good rule of thumb, and I'll generally grab a 6 to 8 bit quant to save on RAM without really thinking about, and I try out models down to 2 bits if I need to in order to fit them into my system.

lambda · 2026-02-17T20:31:35 1771360295

This isn't the paper that I was thinking of, but it shows a similar trend to the one I was looking at. In this particular case, even down to 5 bits showed no measurable reduction in performance (actually a slight increase, but that probably just means that you're withing the noise of what this test can distinguish), then you see performance dropping off rapidly as it gets down to 3 various 3 bit quants: https://arxiv.org/pdf/2601.14277

There was another paper that did a similar test, but with several models in a family, and all the way down to 1 bit, and it was only at 1 bit that it crossed over to having worse performance than the next smaller model. But yeah, I'm having a hard time finding that paper again.

Wowfunhappy · 2026-02-18T02:37:14 1771382234

So, why does ChatGPT not use fewer bits? Sure they have big data centers but they still have to pay for those.

lambda · 2026-02-18T18:50:04 1771440604

Why do you think ChatGPT doesn't use a quant? GPT-OSS, which OpenAI released as open weights, uses a 4 bit quant, which is in some ways a sweet spot, it loses a small amount of performance in exchange for a very large reduction in memory usage compared to something like fp16. I think it's perfectly reasonable to expect that ChatGPT also uses the same technique, but we don't know because their SOTA models aren't open.

https://arxiv.org/pdf/2508.10925