Ch 5 — Ollama: Your Local AI Runtime

Install, pull, run — from zero to local AI in 5 minutes
Hands-On
info
What Is It
arrow_forward
download
Install
arrow_forward
play_arrow
Run
arrow_forward
api
API
arrow_forward
edit_note
Modelfile
arrow_forward
inventory_2
Management
arrow_forward
developer_board
GPU vs CPU
arrow_forward
checklist
Cheat Sheet
-
Click play or press Space to begin...
Step- / 8
info
What Is Ollama?
The Docker of local AI — one tool to pull, run, and manage models
Ollama in One Sentence
Ollama is a tool that lets you download and run LLMs locally with a single command. It handles model downloading, GGUF format management, GPU/CPU allocation, and serves an API — all automatically.

Think of it as Docker for AI models: docker pull nginx becomes ollama pull llama3.2. docker run becomes ollama run. Same simplicity, different domain.
Architecture
Ollama ├── CLI (command-line interface) │ ollama run, pull, list, rm, serve ├── Server (background daemon) │ Listens on localhost:11434 │ REST API for programmatic access ├── Model Registry (ollama.com/library) │ Pre-quantized models ready to pull └── Inference Engine (llama.cpp) The actual C++ code that runs models Handles GPU offloading, KV cache, etc. Ollama is a wrapper around llama.cpp that adds model management, an API, and a much simpler user experience.
Key insight: Ollama’s genius is abstraction. You don’t need to know about GGUF formats, quantization levels, GPU layers, or context sizes. Pull a model, run it. Ollama figures out the rest. When you need more control, Chapter 6 covers llama.cpp directly.
download
Installation & First Model
From zero to running a local LLM in under 5 minutes
Install
macOS: brew install ollama # or download from ollama.com Linux: curl -fsSL https://ollama.com/install.sh | sh Windows: # Download installer from ollama.com Docker: docker run -d -v ollama:/root/.ollama \ -p 11434:11434 --name ollama \ ollama/ollama
Your First Model
# Pull and run in one command: $ ollama run llama3.2 # First time: downloads ~2GB model # Subsequent: starts instantly >>> What is the capital of France? The capital of France is Paris. It is the largest city in France and serves as the country's political, economic, and cultural center... >>> /bye # Exit the chat
What Just Happened
1. Ollama downloaded the Llama 3.2 3B model (Q4_K_M quantization) from its registry
2. Loaded it into RAM (or VRAM if GPU available)
3. Started an interactive chat session
4. All inference happened locally — no data left your machine
Key insight: That’s it. One command. No API keys, no accounts, no configuration. The model runs entirely on your hardware. Every prompt and response stays on your machine. This is the power of local AI.
play_arrow
Running Models: Commands & Options
Pull, run, and control models from the command line
Essential Commands
# Pull a model (download only) $ ollama pull qwen2.5:7b # Run interactively $ ollama run qwen2.5:7b # Run with a specific prompt $ ollama run qwen2.5:7b "Summarize this" # Pipe input $ cat document.txt | ollama run qwen2.5:7b \ "Summarize this document in 3 bullets" # List downloaded models $ ollama list NAME SIZE MODIFIED llama3.2:latest 2.0 GB 2 hours ago qwen2.5:7b 4.4 GB 5 min ago # Remove a model $ ollama rm llama3.2 # Show model details $ ollama show qwen2.5:7b
Model Naming
Format: name:tag ollama run llama3.2 → llama3.2:latest (default tag) ollama run llama3.2:1b → 1B parameter variant ollama run qwen2.5:7b → 7B parameter variant ollama run mistral-small → Mistral Small 3.1 (24B)
Popular Models on Ollama
llama3.2 — Meta’s 3B, great all-rounder
qwen2.5:7b — Alibaba’s 7B, strong reasoning
gemma2:2b — Google’s 2B, tiny and fast
phi4-mini — Microsoft’s 3.8B, math/code
mistral-small — Mistral’s 24B, best medium
codellama:7b — Code-specialized Llama
nomic-embed-text — Embeddings model
Key insight: Ollama’s model registry at ollama.com/library has hundreds of pre-quantized models ready to pull. You don’t need to find GGUF files on Hugging Face or quantize anything yourself. Just ollama pull model-name and go.
api
The Ollama API
REST endpoints for programmatic access — build apps, not just chat
Start the Server
# Ollama server starts automatically # on install, or start manually: $ ollama serve # Server runs on localhost:11434 # API docs: docs.ollama.com/api
Generate (Completion)
curl http://localhost:11434/api/generate \ -d '{ "model": "qwen2.5:7b", "prompt": "Explain quantization", "stream": false }'
Chat (Multi-Turn)
curl http://localhost:11434/api/chat \ -d '{ "model": "qwen2.5:7b", "messages": [ {"role": "user", "content": "What is GGUF?"} ] }'
Python Client
import ollama # Simple generation response = ollama.generate( model='qwen2.5:7b', prompt='Explain quantization in 2 sentences' ) print(response['response']) # Chat with history response = ollama.chat( model='qwen2.5:7b', messages=[ {'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'What is GGUF?'} ] ) # Streaming for chunk in ollama.chat( model='qwen2.5:7b', messages=[...], stream=True ): print(chunk['message']['content'], end='')
Key insight: Ollama’s API is intentionally similar to OpenAI’s. This means you can swap a cloud model for a local one by changing the base URL from api.openai.com to localhost:11434. Many libraries (LangChain, LlamaIndex) support Ollama as a drop-in replacement. We’ll build on this in Chapter 7.
edit_note
Custom Modelfiles
Create custom model configurations with system prompts, parameters, and templates
What’s a Modelfile?
A Modelfile is like a Dockerfile for AI models. It defines a custom model configuration: which base model to use, what system prompt to set, and what parameters to tune.

This lets you create specialized models (“my-support-bot”, “my-code-reviewer”) that you can share and reuse.
Example Modelfile
# File: Modelfile.support-bot FROM qwen2.5:7b SYSTEM """You are a customer support assistant for AcmeCorp. Be helpful, concise, and professional. Never make up information. If you don't know, say so.""" PARAMETER temperature 0.3 PARAMETER num_ctx 4096 PARAMETER top_p 0.9
Create & Run
# Create the custom model $ ollama create support-bot \ -f Modelfile.support-bot # Run it $ ollama run support-bot >>> How do I reset my password? # It now uses the system prompt # and parameters from the Modelfile
Key Parameters
temperature 0.0-2.0 Creativity num_ctx 2048-131072 Context window top_p 0.0-1.0 Nucleus sampling top_k 1-100 Top-K sampling repeat_penalty 1.0-2.0 Repetition control num_gpu 0-999 GPU layers to offload seed int Reproducible output
Key insight: Modelfiles let you version-control your model configurations. Commit them to git, share with your team, deploy consistently. A Modelfile + base model = a reproducible, specialized AI assistant that anyone on your team can run with ollama create.
inventory_2
Model Management
Where models live, how to manage storage, and keeping things tidy
Storage Locations
macOS: ~/.ollama/models/ Linux: /usr/share/ollama/.ollama/models/ Windows: C:\Users\<user>\.ollama\models\ Custom location: OLLAMA_MODELS=/path/to/models ollama serve
Management Commands
# List all models with sizes $ ollama list # See what's currently loaded in RAM $ ollama ps # Remove a model (frees disk space) $ ollama rm llama3.2 # Copy a model (for custom variants) $ ollama cp qwen2.5:7b my-qwen # Show model details (params, template) $ ollama show qwen2.5:7b --modelfile
Memory Management
Models stay loaded for 5 minutes after the last request (configurable via keep_alive). This means:

• First request: slow (model loads into RAM/VRAM)
• Subsequent requests within 5 min: fast (already loaded)
• After 5 min idle: model unloads, RAM freed

Running multiple models: Ollama can load multiple models simultaneously if you have enough RAM. Each model occupies its full size in memory.
Disk Space Tips
Models are large. A typical setup might have 3–5 models totaling 15–25GB. Regularly clean unused models with ollama rm. Use ollama list to see what’s taking space.
Key insight: Ollama’s model management is simple but effective. The 5-minute keep-alive means you get fast responses during active use without permanently consuming RAM. For production servers, set keep_alive to -1 (never unload) for consistent latency.
developer_board
GPU vs CPU Inference
When GPU matters, when CPU is fine, and how Ollama decides
How Ollama Allocates
Ollama automatically detects your GPU and offloads as many model layers as will fit in VRAM. Remaining layers run on CPU.

Full GPU: All layers in VRAM. Fastest. Happens when model fits entirely in VRAM.

Partial GPU: Some layers in VRAM, rest on CPU. Common with larger models on consumer GPUs.

CPU only: No GPU, or GPU too small. Slower but still works. Apple Silicon (M1/M2/M3) uses unified memory — CPU and GPU share the same RAM, so it’s always “GPU.”
Speed Comparison
Qwen 2.5 7B Q4_K_M — tokens/sec: RTX 4090 (24GB VRAM): ~95 tok/s RTX 4070 (12GB VRAM): ~65 tok/s Apple M2 Pro (16GB): ~45 tok/s Apple M1 (8GB): ~25 tok/s Intel i7 CPU-only: ~8 tok/s Raspberry Pi 5: ~2 tok/s Apple Silicon is special: unified memory means the "GPU" and "CPU" share the same fast RAM. No PCIe bottleneck for data transfer.
Key insight: For local AI, Apple Silicon Macs are surprisingly good — unified memory means a 16GB M2 Pro can run a 9B model at 45 tok/s with no discrete GPU. On the NVIDIA side, the RTX 4090 is the gold standard. CPU-only is viable for batch processing but too slow for interactive use.
checklist
The Ollama Cheat Sheet
Everything you need on one card
Commands
ollama pull <model> Download model ollama run <model> Chat interactively ollama list Show downloaded ollama ps Show running ollama rm <model> Delete model ollama show <model> Model details ollama cp <src> <dst> Copy model ollama create <name> -f Modelfile ollama serve Start server
API Endpoints
POST /api/generate Completion POST /api/chat Chat (multi-turn) POST /api/embeddings Embeddings GET /api/tags List models POST /api/show Model info POST /api/pull Pull model DELETE /api/delete Delete model GET /api/ps Running models
Recommended First Models
8GB RAM: ollama run llama3.2 # 3B, 2GB ollama run gemma2:2b # 2B, 1.6GB 16GB RAM: ollama run qwen2.5:7b # 7B, 4.4GB ollama run phi4-mini # 3.8B, 2.5GB 24GB+ RAM: ollama run mistral-small # 24B, 14GB ollama run qwen2.5:14b # 14B, 9GB
Key insight: Ollama is your daily driver for local AI. Install it, pull a model, and you have a private, free, fast AI assistant running on your machine. For most users, this is all you need. Chapter 6 covers llama.cpp for when you need more control — custom quantization, server tuning, or building from source.