Ch 6: llama.cpp & GGUF Deep Dive — Small Models & Local AI

Ch 6 — llama.cpp & GGUF Deep Dive

The C++ engine under the hood — convert, quantize, serve, and tune

arrow_backIndex

Hands-On

code

What Is It

arrow_forward

swap_horiz

Convert

arrow_forward

compress

Quantize

arrow_forward

terminal

CLI Run

arrow_forward

dns

Server Mode

arrow_forward

tune

Perf Tuning

arrow_forward

compare

Engines

arrow_forward

checklist

When to Use

Click play or press Space to begin...

Step- / 8

code

What Is llama.cpp?

The C++ inference engine that powers almost all local AI

The Origin Story

In March 2023, Georgi Gerganov created llama.cpp — a pure C/C++ implementation of Meta’s LLaMA model inference. The goal: run LLMs on consumer hardware without Python, PyTorch, or CUDA dependencies.

It worked. Within weeks, people were running 7B models on MacBook Airs. The project exploded — it now has 75,000+ GitHub stars and supports virtually every open-weight model architecture.

Why It Matters

llama.cpp powers: ✓ Ollama (Ch 5) ✓ LM Studio ✓ GPT4All ✓ Jan ✓ Kobold.cpp ✓ Text Generation WebUI Key innovations: ✓ GGUF format (Ch 3) ✓ K-quant quantization methods ✓ Metal (Apple GPU) acceleration ✓ CUDA, Vulkan, SYCL GPU support ✓ Flash attention implementation ✓ Speculative decoding support Almost every local AI tool is either built on llama.cpp or inspired by it.

Key insight: llama.cpp is the foundation of the local AI ecosystem. Ollama is a user-friendly wrapper around it. When you need more control than Ollama provides — custom quantization, specific GPU layer allocation, server tuning — you go directly to llama.cpp.

swap_horiz

Converting Models to GGUF

Taking a Hugging Face model and making it llama.cpp-compatible

When You Need to Convert

Most popular models already have GGUF versions on Hugging Face (uploaded by the community, especially by users like “bartowski” and “TheBloke”). You only need to convert yourself when:

• You fine-tuned a model and want to run it locally
• A new model just released and no GGUF exists yet
• You want a specific quantization not available

The Conversion Pipeline

# 1. Clone llama.cpp $ git clone https://github.com/ggerganov/llama.cpp $ cd llama.cpp # 2. Build $ cmake -B build $ cmake --build build --config Release # 3. Install Python dependencies $ pip install -r requirements.txt # 4. Convert HF model to GGUF (FP16) $ python convert_hf_to_gguf.py \ /path/to/hf-model/ \ --outfile model-f16.gguf \ --outtype f16

What the Converter Does

Input: Hugging Face model directory ├── config.json ├── tokenizer.json ├── model-00001-of-00004.safetensors ├── model-00002-of-00004.safetensors └── ... Output: Single GGUF file model-f16.gguf ├── Architecture metadata ├── Tokenizer (embedded) ├── Chat template (embedded) └── All weights (FP16) The converter reads the HF format, maps tensor names to llama.cpp's internal naming, embeds the tokenizer, and writes everything to one file.

Key insight: The conversion step produces an FP16 GGUF — still large but now in the right format. The next step (quantization) is where the real size reduction happens. Think of conversion as “translating the language” and quantization as “compressing the file.”

compress

Quantizing Your Own Models

From FP16 GGUF to Q4_K_M — the hands-on process

The Quantize Command

# Quantize FP16 → Q4_K_M $ ./build/bin/llama-quantize \ model-f16.gguf \ model-Q4_K_M.gguf \ Q4_K_M # Other common targets: $ ./build/bin/llama-quantize \ model-f16.gguf model-Q5_K_M.gguf Q5_K_M $ ./build/bin/llama-quantize \ model-f16.gguf model-Q8_0.gguf Q8_0 # Takes 2-10 minutes depending on # model size and your CPU speed.

Available Quantization Types

Type Bits Use Case Q2_K 2.6 Extreme compression (bad) Q3_K_S 3.4 Very small, low quality Q3_K_M 3.9 Small, acceptable Q4_K_S 4.3 Good balance, smaller Q4_K_M 4.5 ← Recommended default Q5_K_M 5.1 ← Quality-focused Q5_K_S 4.9 Quality, slightly smaller Q6_K 6.6 High quality Q8_0 8.5 ← Near-lossless F16 16.0 Half precision (large)

Importance Matrix (imatrix)

Advanced: For Q3 and Q4, you can generate an “importance matrix” from a calibration dataset. This tells the quantizer which weights are most important, preserving them at higher precision. Improves quality at low bit depths.

Key insight: Quantization is a one-time process. Convert once, use forever. The output GGUF file can be used with Ollama (ollama create with a Modelfile pointing to it), llama.cpp directly, LM Studio, or any GGUF-compatible tool. One file, runs everywhere.

terminal

Running Models with llama-cli

Direct inference without Ollama — maximum control

Interactive Chat

# Interactive chat mode $ ./build/bin/llama-cli \ -m model-Q4_K_M.gguf \ -c 4096 \ -ngl 99 \ --chat-template chatml \ -cnv # Flags: # -m model file path # -c context size (tokens) # -ngl GPU layers to offload (99 = all) # -cnv conversation mode # --chat-template format for chat

Single Prompt

# One-shot generation $ ./build/bin/llama-cli \ -m model-Q4_K_M.gguf \ -p "Explain quantization in 2 sentences" \ -n 100 \ -ngl 99 # -p prompt text # -n max tokens to generate

Key Parameters

-c 4096 Context window size -ngl 99 GPU layers (99 = all) -t 8 CPU threads -b 512 Batch size (prompt proc) -ub 512 Micro-batch size --temp 0.7 Temperature --top-p 0.9 Nucleus sampling --top-k 40 Top-K sampling --repeat-penalty 1.1 -fa Flash attention (faster) --mlock Lock model in RAM

Key insight: The -ngl flag is the most important performance parameter. It controls how many transformer layers run on GPU vs CPU. Set it to 99 to offload everything to GPU. If you run out of VRAM, reduce it — llama.cpp will automatically split between GPU and CPU.

dns

Server Mode: OpenAI-Compatible API

Run llama.cpp as an API server — drop-in replacement for OpenAI

Start the Server

# Start llama.cpp server $ ./build/bin/llama-server \ -m model-Q4_K_M.gguf \ -c 4096 \ -ngl 99 \ --host 0.0.0.0 \ --port 8080 \ -fa # Server starts on http://localhost:8080 # OpenAI-compatible endpoints: # /v1/chat/completions # /v1/completions # /v1/embeddings # /v1/models

Use with OpenAI Client

from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="not-needed" ) response = client.chat.completions.create( model="local-model", messages=[ {"role": "user", "content": "What is GGUF?"} ], temperature=0.7 ) print(response.choices[0].message.content)

Key insight: llama.cpp’s server mode speaks the OpenAI API protocol. Any code, library, or tool that works with OpenAI’s API works with llama.cpp by changing one line: the base URL. This is the foundation for building local AI applications (Chapter 7).

tune

Performance Tuning

Squeezing maximum tokens/sec from your hardware

The Big Three Parameters

1. GPU Layers (-ngl) More layers on GPU = faster Start with 99, reduce if OOM Each layer: ~100-300MB VRAM (7B model) 2. Context Size (-c) Larger context = more RAM 4K: ~0.5GB overhead 8K: ~1.0GB overhead 32K: ~3.0GB overhead Use smallest context that works 3. Batch Size (-b) Affects prompt processing speed Default 512 is usually good Increase for long prompts Decrease if running out of memory

Advanced Tuning

Flash Attention (-fa) Reduces memory usage for attention Enables larger context windows ~10-20% faster on supported hardware Thread Count (-t) CPU threads for computation Default: auto-detect Rule: physical cores, not logical M2 Pro: -t 8 (not 12) Memory Locking (--mlock) Prevents OS from swapping model to disk Important for consistent latency Requires sufficient RAM KV Cache Quantization --cache-type-k q8_0 --cache-type-v q8_0 Reduces context memory by ~50%

Key insight: The biggest performance win is getting the model fully into GPU VRAM (-ngl 99). If that’s not possible, the second biggest win is flash attention (-fa) which reduces memory usage. KV cache quantization is the third lever — it lets you fit larger contexts without more RAM.

compare

Inference Engine Comparison

llama.cpp vs vLLM vs TGI — different tools for different jobs

The Three Engines

llama.cpp Language: C/C++ Target: Consumer hardware GPU: CUDA, Metal, Vulkan Format: GGUF Strength: CPU inference, Apple Silicon, edge devices, single-user Best for: Local deployment vLLM Language: Python + C++ Target: Server/cloud GPU GPU: CUDA (NVIDIA only) Format: HF SafeTensors Strength: PagedAttention, high throughput, concurrent requests, batching Best for: Production API servers TGI (Text Generation Inference) Language: Rust + Python Target: Server/cloud GPU GPU: CUDA (NVIDIA only) Format: HF SafeTensors Strength: Hugging Face ecosystem, tensor parallelism, Docker Best for: HF-integrated deployments

When to Use Each

llama.cpp / Ollama: Local development, single-user, Apple Silicon, edge devices, privacy-first. The default choice for this course.

vLLM: When you need to serve many concurrent users on NVIDIA GPUs. PagedAttention gives 2–4x throughput over naive serving. Production API servers.

TGI: When you’re already in the Hugging Face ecosystem and want Docker-based deployment with tensor parallelism across multiple GPUs.

Key insight: For local AI (the focus of this course), llama.cpp/Ollama is the right choice. vLLM and TGI are server-side tools for when you’re serving hundreds or thousands of concurrent users on dedicated GPU servers. Different tools, different scale.

checklist

Ollama vs llama.cpp: When to Use Which

The decision framework for choosing your tool

Use Ollama When

✓ You want to get started quickly ✓ Pre-quantized models are available ✓ You need model management (pull/rm) ✓ You want a simple REST API ✓ You're building applications (Ch 7) ✓ You want Modelfile customization ✓ You don't need fine-grained control Ollama is the right choice 90% of the time. Start here.

Use llama.cpp Directly When

✓ You need to convert/quantize models ✓ You need specific GPU layer control ✓ You want KV cache quantization ✓ You need flash attention tuning ✓ You're embedding in a C/C++ app ✓ You need grammar-constrained output ✓ You want maximum performance tuning ✓ You're deploying to edge devices llama.cpp is the power-user tool. Use when Ollama isn't enough.

Key insight: Think of it as a progression: start with Ollama for simplicity, drop down to llama.cpp when you need more control. They use the same engine and the same GGUF files — switching is seamless. Next up: Chapter 7 puts this into practice by building actual applications.

arrow_back Ch 5: Ollama Ch 7: Building Local Apps arrow_forward