Ch 5: Ollama — Small Models & Local AI

Ch 5 — Ollama: Your Local AI Runtime

Install, pull, run — from zero to local AI in 5 minutes

arrow_backIndex

Hands-On

info

What Is It

arrow_forward

download

Install

arrow_forward

play_arrow

Run

arrow_forward

api

API

arrow_forward

edit_note

Modelfile

arrow_forward

inventory_2

Management

arrow_forward

developer_board

GPU vs CPU

arrow_forward

checklist

Cheat Sheet

Click play or press Space to begin...

Step- / 8

info

What Is Ollama?

The Docker of local AI — one tool to pull, run, and manage models

Ollama in One Sentence

Ollama is a tool that lets you download and run LLMs locally with a single command. It handles model downloading, GGUF format management, GPU/CPU allocation, and serves an API — all automatically.

Think of it as Docker for AI models: docker pull nginx becomes ollama pull llama3.2. docker run becomes ollama run. Same simplicity, different domain.

Architecture

Ollama ├── CLI (command-line interface) │ ollama run, pull, list, rm, serve ├── Server (background daemon) │ Listens on localhost:11434 │ REST API for programmatic access ├── Model Registry (ollama.com/library) │ Pre-quantized models ready to pull └── Inference Engine (llama.cpp) The actual C++ code that runs models Handles GPU offloading, KV cache, etc. Ollama is a wrapper around llama.cpp that adds model management, an API, and a much simpler user experience.

Key insight: Ollama’s genius is abstraction. You don’t need to know about GGUF formats, quantization levels, GPU layers, or context sizes. Pull a model, run it. Ollama figures out the rest. When you need more control, Chapter 6 covers llama.cpp directly.

download

Installation & First Model

From zero to running a local LLM in under 5 minutes

Install

macOS: brew install ollama # or download from ollama.com Linux: curl -fsSL https://ollama.com/install.sh | sh Windows: # Download installer from ollama.com Docker: docker run -d -v ollama:/root/.ollama \ -p 11434:11434 --name ollama \ ollama/ollama

Your First Model

# Pull and run in one command: $ ollama run llama3.2 # First time: downloads ~2GB model # Subsequent: starts instantly >>> What is the capital of France? The capital of France is Paris. It is the largest city in France and serves as the country's political, economic, and cultural center... >>> /bye # Exit the chat

What Just Happened

1. Ollama downloaded the Llama 3.2 3B model (Q4_K_M quantization) from its registry
2. Loaded it into RAM (or VRAM if GPU available)
3. Started an interactive chat session
4. All inference happened locally — no data left your machine

Key insight: That’s it. One command. No API keys, no accounts, no configuration. The model runs entirely on your hardware. Every prompt and response stays on your machine. This is the power of local AI.

play_arrow

Running Models: Commands & Options

Pull, run, and control models from the command line

Essential Commands

# Pull a model (download only) $ ollama pull qwen2.5:7b # Run interactively $ ollama run qwen2.5:7b # Run with a specific prompt $ ollama run qwen2.5:7b "Summarize this" # Pipe input $ cat document.txt | ollama run qwen2.5:7b \ "Summarize this document in 3 bullets" # List downloaded models $ ollama list NAME SIZE MODIFIED llama3.2:latest 2.0 GB 2 hours ago qwen2.5:7b 4.4 GB 5 min ago # Remove a model $ ollama rm llama3.2 # Show model details $ ollama show qwen2.5:7b

Model Naming

Format: name:tag ollama run llama3.2 → llama3.2:latest (default tag) ollama run llama3.2:1b → 1B parameter variant ollama run qwen2.5:7b → 7B parameter variant ollama run mistral-small → Mistral Small 3.1 (24B)

Popular Models on Ollama

llama3.2 — Meta’s 3B, great all-rounder
qwen2.5:7b — Alibaba’s 7B, strong reasoning
gemma2:2b — Google’s 2B, tiny and fast
phi4-mini — Microsoft’s 3.8B, math/code
mistral-small — Mistral’s 24B, best medium
codellama:7b — Code-specialized Llama
nomic-embed-text — Embeddings model

Key insight: Ollama’s model registry at ollama.com/library has hundreds of pre-quantized models ready to pull. You don’t need to find GGUF files on Hugging Face or quantize anything yourself. Just ollama pull model-name and go.

api

The Ollama API

REST endpoints for programmatic access — build apps, not just chat

Start the Server

# Ollama server starts automatically # on install, or start manually: $ ollama serve # Server runs on localhost:11434 # API docs: docs.ollama.com/api

Generate (Completion)

curl http://localhost:11434/api/generate \ -d '{ "model": "qwen2.5:7b", "prompt": "Explain quantization", "stream": false }'

Chat (Multi-Turn)

curl http://localhost:11434/api/chat \ -d '{ "model": "qwen2.5:7b", "messages": [ {"role": "user", "content": "What is GGUF?"} ] }'

Python Client

import ollama # Simple generation response = ollama.generate( model='qwen2.5:7b', prompt='Explain quantization in 2 sentences' ) print(response['response']) # Chat with history response = ollama.chat( model='qwen2.5:7b', messages=[ {'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'What is GGUF?'} ] ) # Streaming for chunk in ollama.chat( model='qwen2.5:7b', messages=[...], stream=True ): print(chunk['message']['content'], end='')

Key insight: Ollama’s API is intentionally similar to OpenAI’s. This means you can swap a cloud model for a local one by changing the base URL from api.openai.com to localhost:11434. Many libraries (LangChain, LlamaIndex) support Ollama as a drop-in replacement. We’ll build on this in Chapter 7.

edit_note

Custom Modelfiles

Create custom model configurations with system prompts, parameters, and templates

What’s a Modelfile?

A Modelfile is like a Dockerfile for AI models. It defines a custom model configuration: which base model to use, what system prompt to set, and what parameters to tune.

This lets you create specialized models (“my-support-bot”, “my-code-reviewer”) that you can share and reuse.

Example Modelfile

# File: Modelfile.support-bot FROM qwen2.5:7b SYSTEM """You are a customer support assistant for AcmeCorp. Be helpful, concise, and professional. Never make up information. If you don't know, say so.""" PARAMETER temperature 0.3 PARAMETER num_ctx 4096 PARAMETER top_p 0.9

Create & Run

# Create the custom model $ ollama create support-bot \ -f Modelfile.support-bot # Run it $ ollama run support-bot >>> How do I reset my password? # It now uses the system prompt # and parameters from the Modelfile

Key Parameters

temperature 0.0-2.0 Creativity num_ctx 2048-131072 Context window top_p 0.0-1.0 Nucleus sampling top_k 1-100 Top-K sampling repeat_penalty 1.0-2.0 Repetition control num_gpu 0-999 GPU layers to offload seed int Reproducible output

Key insight: Modelfiles let you version-control your model configurations. Commit them to git, share with your team, deploy consistently. A Modelfile + base model = a reproducible, specialized AI assistant that anyone on your team can run with ollama create.

inventory_2

Model Management

Where models live, how to manage storage, and keeping things tidy

Storage Locations

macOS: ~/.ollama/models/ Linux: /usr/share/ollama/.ollama/models/ Windows: C:\Users\<user>\.ollama\models\ Custom location: OLLAMA_MODELS=/path/to/models ollama serve

Management Commands

# List all models with sizes $ ollama list # See what's currently loaded in RAM $ ollama ps # Remove a model (frees disk space) $ ollama rm llama3.2 # Copy a model (for custom variants) $ ollama cp qwen2.5:7b my-qwen # Show model details (params, template) $ ollama show qwen2.5:7b --modelfile

Memory Management

Models stay loaded for 5 minutes after the last request (configurable via keep_alive). This means:

• First request: slow (model loads into RAM/VRAM)
• Subsequent requests within 5 min: fast (already loaded)
• After 5 min idle: model unloads, RAM freed

Running multiple models: Ollama can load multiple models simultaneously if you have enough RAM. Each model occupies its full size in memory.

Disk Space Tips

Models are large. A typical setup might have 3–5 models totaling 15–25GB. Regularly clean unused models with ollama rm. Use ollama list to see what’s taking space.

Key insight: Ollama’s model management is simple but effective. The 5-minute keep-alive means you get fast responses during active use without permanently consuming RAM. For production servers, set keep_alive to -1 (never unload) for consistent latency.

developer_board

GPU vs CPU Inference

When GPU matters, when CPU is fine, and how Ollama decides

How Ollama Allocates

Ollama automatically detects your GPU and offloads as many model layers as will fit in VRAM. Remaining layers run on CPU.

Full GPU: All layers in VRAM. Fastest. Happens when model fits entirely in VRAM.

Partial GPU: Some layers in VRAM, rest on CPU. Common with larger models on consumer GPUs.

CPU only: No GPU, or GPU too small. Slower but still works. Apple Silicon (M1/M2/M3) uses unified memory — CPU and GPU share the same RAM, so it’s always “GPU.”

Speed Comparison

Qwen 2.5 7B Q4_K_M — tokens/sec: RTX 4090 (24GB VRAM): ~95 tok/s RTX 4070 (12GB VRAM): ~65 tok/s Apple M2 Pro (16GB): ~45 tok/s Apple M1 (8GB): ~25 tok/s Intel i7 CPU-only: ~8 tok/s Raspberry Pi 5: ~2 tok/s Apple Silicon is special: unified memory means the "GPU" and "CPU" share the same fast RAM. No PCIe bottleneck for data transfer.

Key insight: For local AI, Apple Silicon Macs are surprisingly good — unified memory means a 16GB M2 Pro can run a 9B model at 45 tok/s with no discrete GPU. On the NVIDIA side, the RTX 4090 is the gold standard. CPU-only is viable for batch processing but too slow for interactive use.

checklist

The Ollama Cheat Sheet

Everything you need on one card

Commands

ollama pull <model> Download model ollama run <model> Chat interactively ollama list Show downloaded ollama ps Show running ollama rm <model> Delete model ollama show <model> Model details ollama cp <src> <dst> Copy model ollama create <name> -f Modelfile ollama serve Start server

API Endpoints

POST /api/generate Completion POST /api/chat Chat (multi-turn) POST /api/embeddings Embeddings GET /api/tags List models POST /api/show Model info POST /api/pull Pull model DELETE /api/delete Delete model GET /api/ps Running models

Recommended First Models

8GB RAM: ollama run llama3.2 # 3B, 2GB ollama run gemma2:2b # 2B, 1.6GB 16GB RAM: ollama run qwen2.5:7b # 7B, 4.4GB ollama run phi4-mini # 3.8B, 2.5GB 24GB+ RAM: ollama run mistral-small # 24B, 14GB ollama run qwen2.5:14b # 14B, 9GB

Key insight: Ollama is your daily driver for local AI. Install it, pull a model, and you have a private, free, fast AI assistant running on your machine. For most users, this is all you need. Chapter 6 covers llama.cpp for when you need more control — custom quantization, server tuning, or building from source.

arrow_back Ch 4: Distillation & Pruning Ch 6: llama.cpp & GGUF arrow_forward