The Quantize Command
# Quantize FP16 → Q4_K_M
$ ./build/bin/llama-quantize \
model-f16.gguf \
model-Q4_K_M.gguf \
Q4_K_M
# Other common targets:
$ ./build/bin/llama-quantize \
model-f16.gguf model-Q5_K_M.gguf Q5_K_M
$ ./build/bin/llama-quantize \
model-f16.gguf model-Q8_0.gguf Q8_0
# Takes 2-10 minutes depending on
# model size and your CPU speed.
Available Quantization Types
Type Bits Use Case
Q2_K 2.6 Extreme compression (bad)
Q3_K_S 3.4 Very small, low quality
Q3_K_M 3.9 Small, acceptable
Q4_K_S 4.3 Good balance, smaller
Q4_K_M 4.5 ← Recommended default
Q5_K_M 5.1 ← Quality-focused
Q5_K_S 4.9 Quality, slightly smaller
Q6_K 6.6 High quality
Q8_0 8.5 ← Near-lossless
F16 16.0 Half precision (large)
Importance Matrix (imatrix)
Advanced: For Q3 and Q4, you can generate an “importance matrix” from a calibration dataset. This tells the quantizer which weights are most important, preserving them at higher precision. Improves quality at low bit depths.
Key insight: Quantization is a one-time process. Convert once, use forever. The output GGUF file can be used with Ollama (ollama create with a Modelfile pointing to it), llama.cpp directly, LM Studio, or any GGUF-compatible tool. One file, runs everywhere.