Quantisation converts model weights from higher precision (FP16, BF16) to lower precision (INT8, INT4). Q4_K_M — 4-bit quantisation with mixed precision — is the dominant scheme for local inference in 2026. A Llama 3.3 70B model that needs 140 GB at FP16 fits in 35-40 GB at Q4. Quality loss is real but typically manageable for agent workloads.
Related terms
Found a definition that's wrong, dated or could be sharper? Email us — we update with attribution unless you'd rather we didn't.