Ch 6 — llama.cpp & GGUF Deep Dive

The C++ engine under the hood — convert, quantize, serve, and tune
Hands-On
code
What Is It
arrow_forward
swap_horiz
Convert
arrow_forward
compress
Quantize
arrow_forward
terminal
CLI Run
arrow_forward
dns
Server Mode
arrow_forward
tune
Perf Tuning
arrow_forward
compare
Engines
arrow_forward
checklist
When to Use
-
Click play or press Space to begin...
Step- / 8
code
What Is llama.cpp?
The C++ inference engine that powers almost all local AI
The Origin Story
In March 2023, Georgi Gerganov created llama.cpp — a pure C/C++ implementation of Meta’s LLaMA model inference. The goal: run LLMs on consumer hardware without Python, PyTorch, or CUDA dependencies.

It worked. Within weeks, people were running 7B models on MacBook Airs. The project exploded — it now has 75,000+ GitHub stars and supports virtually every open-weight model architecture.
Why It Matters
llama.cpp powers: ✓ Ollama (Ch 5) ✓ LM Studio ✓ GPT4All ✓ Jan ✓ Kobold.cpp ✓ Text Generation WebUI Key innovations: ✓ GGUF format (Ch 3) ✓ K-quant quantization methods ✓ Metal (Apple GPU) acceleration ✓ CUDA, Vulkan, SYCL GPU support ✓ Flash attention implementation ✓ Speculative decoding support Almost every local AI tool is either built on llama.cpp or inspired by it.
Key insight: llama.cpp is the foundation of the local AI ecosystem. Ollama is a user-friendly wrapper around it. When you need more control than Ollama provides — custom quantization, specific GPU layer allocation, server tuning — you go directly to llama.cpp.
swap_horiz
Converting Models to GGUF
Taking a Hugging Face model and making it llama.cpp-compatible
When You Need to Convert
Most popular models already have GGUF versions on Hugging Face (uploaded by the community, especially by users like “bartowski” and “TheBloke”). You only need to convert yourself when:

• You fine-tuned a model and want to run it locally
• A new model just released and no GGUF exists yet
• You want a specific quantization not available
The Conversion Pipeline
# 1. Clone llama.cpp $ git clone https://github.com/ggerganov/llama.cpp $ cd llama.cpp # 2. Build $ cmake -B build $ cmake --build build --config Release # 3. Install Python dependencies $ pip install -r requirements.txt # 4. Convert HF model to GGUF (FP16) $ python convert_hf_to_gguf.py \ /path/to/hf-model/ \ --outfile model-f16.gguf \ --outtype f16
What the Converter Does
Input: Hugging Face model directory ├── config.json ├── tokenizer.json ├── model-00001-of-00004.safetensors ├── model-00002-of-00004.safetensors └── ... Output: Single GGUF file model-f16.gguf ├── Architecture metadata ├── Tokenizer (embedded) ├── Chat template (embedded) └── All weights (FP16) The converter reads the HF format, maps tensor names to llama.cpp's internal naming, embeds the tokenizer, and writes everything to one file.
Key insight: The conversion step produces an FP16 GGUF — still large but now in the right format. The next step (quantization) is where the real size reduction happens. Think of conversion as “translating the language” and quantization as “compressing the file.”
compress
Quantizing Your Own Models
From FP16 GGUF to Q4_K_M — the hands-on process
The Quantize Command
# Quantize FP16 → Q4_K_M $ ./build/bin/llama-quantize \ model-f16.gguf \ model-Q4_K_M.gguf \ Q4_K_M # Other common targets: $ ./build/bin/llama-quantize \ model-f16.gguf model-Q5_K_M.gguf Q5_K_M $ ./build/bin/llama-quantize \ model-f16.gguf model-Q8_0.gguf Q8_0 # Takes 2-10 minutes depending on # model size and your CPU speed.
Available Quantization Types
Type Bits Use Case Q2_K 2.6 Extreme compression (bad) Q3_K_S 3.4 Very small, low quality Q3_K_M 3.9 Small, acceptable Q4_K_S 4.3 Good balance, smaller Q4_K_M 4.5 ← Recommended default Q5_K_M 5.1 ← Quality-focused Q5_K_S 4.9 Quality, slightly smaller Q6_K 6.6 High quality Q8_0 8.5 ← Near-lossless F16 16.0 Half precision (large)
Importance Matrix (imatrix)
Advanced: For Q3 and Q4, you can generate an “importance matrix” from a calibration dataset. This tells the quantizer which weights are most important, preserving them at higher precision. Improves quality at low bit depths.
Key insight: Quantization is a one-time process. Convert once, use forever. The output GGUF file can be used with Ollama (ollama create with a Modelfile pointing to it), llama.cpp directly, LM Studio, or any GGUF-compatible tool. One file, runs everywhere.
terminal
Running Models with llama-cli
Direct inference without Ollama — maximum control
Interactive Chat
# Interactive chat mode $ ./build/bin/llama-cli \ -m model-Q4_K_M.gguf \ -c 4096 \ -ngl 99 \ --chat-template chatml \ -cnv # Flags: # -m model file path # -c context size (tokens) # -ngl GPU layers to offload (99 = all) # -cnv conversation mode # --chat-template format for chat
Single Prompt
# One-shot generation $ ./build/bin/llama-cli \ -m model-Q4_K_M.gguf \ -p "Explain quantization in 2 sentences" \ -n 100 \ -ngl 99 # -p prompt text # -n max tokens to generate
Key Parameters
-c 4096 Context window size -ngl 99 GPU layers (99 = all) -t 8 CPU threads -b 512 Batch size (prompt proc) -ub 512 Micro-batch size --temp 0.7 Temperature --top-p 0.9 Nucleus sampling --top-k 40 Top-K sampling --repeat-penalty 1.1 -fa Flash attention (faster) --mlock Lock model in RAM
Key insight: The -ngl flag is the most important performance parameter. It controls how many transformer layers run on GPU vs CPU. Set it to 99 to offload everything to GPU. If you run out of VRAM, reduce it — llama.cpp will automatically split between GPU and CPU.
dns
Server Mode: OpenAI-Compatible API
Run llama.cpp as an API server — drop-in replacement for OpenAI
Start the Server
# Start llama.cpp server $ ./build/bin/llama-server \ -m model-Q4_K_M.gguf \ -c 4096 \ -ngl 99 \ --host 0.0.0.0 \ --port 8080 \ -fa # Server starts on http://localhost:8080 # OpenAI-compatible endpoints: # /v1/chat/completions # /v1/completions # /v1/embeddings # /v1/models
Use with OpenAI Client
from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="not-needed" ) response = client.chat.completions.create( model="local-model", messages=[ {"role": "user", "content": "What is GGUF?"} ], temperature=0.7 ) print(response.choices[0].message.content)
Key insight: llama.cpp’s server mode speaks the OpenAI API protocol. Any code, library, or tool that works with OpenAI’s API works with llama.cpp by changing one line: the base URL. This is the foundation for building local AI applications (Chapter 7).
tune
Performance Tuning
Squeezing maximum tokens/sec from your hardware
The Big Three Parameters
1. GPU Layers (-ngl) More layers on GPU = faster Start with 99, reduce if OOM Each layer: ~100-300MB VRAM (7B model) 2. Context Size (-c) Larger context = more RAM 4K: ~0.5GB overhead 8K: ~1.0GB overhead 32K: ~3.0GB overhead Use smallest context that works 3. Batch Size (-b) Affects prompt processing speed Default 512 is usually good Increase for long prompts Decrease if running out of memory
Advanced Tuning
Flash Attention (-fa) Reduces memory usage for attention Enables larger context windows ~10-20% faster on supported hardware Thread Count (-t) CPU threads for computation Default: auto-detect Rule: physical cores, not logical M2 Pro: -t 8 (not 12) Memory Locking (--mlock) Prevents OS from swapping model to disk Important for consistent latency Requires sufficient RAM KV Cache Quantization --cache-type-k q8_0 --cache-type-v q8_0 Reduces context memory by ~50%
Key insight: The biggest performance win is getting the model fully into GPU VRAM (-ngl 99). If that’s not possible, the second biggest win is flash attention (-fa) which reduces memory usage. KV cache quantization is the third lever — it lets you fit larger contexts without more RAM.
compare
Inference Engine Comparison
llama.cpp vs vLLM vs TGI — different tools for different jobs
The Three Engines
llama.cpp Language: C/C++ Target: Consumer hardware GPU: CUDA, Metal, Vulkan Format: GGUF Strength: CPU inference, Apple Silicon, edge devices, single-user Best for: Local deployment vLLM Language: Python + C++ Target: Server/cloud GPU GPU: CUDA (NVIDIA only) Format: HF SafeTensors Strength: PagedAttention, high throughput, concurrent requests, batching Best for: Production API servers TGI (Text Generation Inference) Language: Rust + Python Target: Server/cloud GPU GPU: CUDA (NVIDIA only) Format: HF SafeTensors Strength: Hugging Face ecosystem, tensor parallelism, Docker Best for: HF-integrated deployments
When to Use Each
llama.cpp / Ollama: Local development, single-user, Apple Silicon, edge devices, privacy-first. The default choice for this course.

vLLM: When you need to serve many concurrent users on NVIDIA GPUs. PagedAttention gives 2–4x throughput over naive serving. Production API servers.

TGI: When you’re already in the Hugging Face ecosystem and want Docker-based deployment with tensor parallelism across multiple GPUs.
Key insight: For local AI (the focus of this course), llama.cpp/Ollama is the right choice. vLLM and TGI are server-side tools for when you’re serving hundreds or thousands of concurrent users on dedicated GPU servers. Different tools, different scale.
checklist
Ollama vs llama.cpp: When to Use Which
The decision framework for choosing your tool
Use Ollama When
You want to get started quickly Pre-quantized models are available You need model management (pull/rm) You want a simple REST API You're building applications (Ch 7) You want Modelfile customization You don't need fine-grained control Ollama is the right choice 90% of the time. Start here.
Use llama.cpp Directly When
You need to convert/quantize models You need specific GPU layer control You want KV cache quantization You need flash attention tuning You're embedding in a C/C++ app You need grammar-constrained output You want maximum performance tuning You're deploying to edge devices llama.cpp is the power-user tool. Use when Ollama isn't enough.
Key insight: Think of it as a progression: start with Ollama for simplicity, drop down to llama.cpp when you need more control. They use the same engine and the same GGUF files — switching is seamless. Next up: Chapter 7 puts this into practice by building actual applications.