PocketClawvol. 1 · 2026

Quantisation

Reducing the numerical precision of model weights to shrink memory footprint and speed up inference, with quality tradeoff.

Quantisation converts model weights from higher precision (FP16, BF16) to lower precision (INT8, INT4). Q4_K_M — 4-bit quantisation with mixed precision — is the dominant scheme for local inference in 2026. A Llama 3.3 70B model that needs 140 GB at FP16 fits in 35-40 GB at Q4. Quality loss is real but typically manageable for agent workloads.

Related terms

Local LLMOllama

Found a definition that's wrong, dated or could be sharper? Email us — we update with attribution unless you'd rather we didn't.