Ch 6 — Running LLMs Locally with Ollama

One command, any model, zero configuration — local AI made simple
Local AI
download
Pull
arrow_forward
inventory
Cache
arrow_forward
memory
Load
arrow_forward
smart_toy
Infer
arrow_forward
api
API
-
Click play or press Space to begin the journey...
Step- / 7
terminal
Getting Started with Ollama
Install, run your first model, zero to AI in 2 minutes
Installation
Mac: brew install ollama or download from ollama.com. Linux: curl -fsSL https://ollama.com/install.sh | sh. Windows: download the installer from ollama.com. Ollama runs as a background service — ollama serve starts it manually if it's not already running.
Your First Model
ollama run llama3.2 — this downloads the selected model and drops you into an interactive chat session. ollama run gemma3:2b for a smaller model. ollama run qwen3:8b as another supported model option. Every model in the Ollama library can be pulled this way.
# Install (macOS) brew install ollama # Run your first model ollama run llama3.2 # Non-interactive: pass a prompt ollama run llama3.2 "Explain RAG in one sentence" # List installed models ollama list # Remove a model ollama rm llama3.2 # Show model info ollama show llama3.2
First run is slow — Ollama downloads the model artifact before first use. Subsequent runs are instant because the model is cached in ~/.ollama/models. On Apple Silicon, Metal GPU acceleration is automatic — no configuration needed.
dns
The Ollama REST API
Programmatic access on port 11434
Starting the Server
ollama serve starts the REST API server on http://localhost:11434. On macOS, it starts automatically on login. On Linux, enable with systemctl enable ollama. The API includes OpenAI-compatible endpoints for common workflows — changing base_url='http://localhost:11434/v1' works for many clients, but verify endpoint parity for your app.
Core Endpoints
POST /api/generate — single-turn generation. POST /api/chat — multi-turn chat with messages array. POST /api/embed — generate embeddings. GET /api/tags — list installed models. POST /api/pull — download a model. All responses stream by default.
import requests # Chat API (OpenAI-compatible format) response = requests.post( "http://localhost:11434/api/chat", json={ "model": "llama3.2", "messages": [ {"role": "user", "content": "What is RAG?"} ], "stream": False } ) print(response.json()["message"]["content"]) # Or use the OpenAI client from openai import OpenAI client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama") resp = client.chat.completions.create( model="llama3.2", messages=[{"role":"user","content":"Hello!"}] )
Streaming responses: By default, Ollama streams tokens as they're generated. Set "stream": false to get the full response at once. For production applications, streaming gives better user experience — you can display tokens as they arrive.
folder
Model Management
Pulling, listing, and organizing local models
The Ollama Model Library
Ollama maintains a curated library at ollama.com/library — pre-quantized GGUF versions of popular models. Each model has tagged variants: llama3.2:3b, llama3.2:3b-instruct-q4_K_M, llama3.2:latest. The :latest tag defaults to the recommended quantization for most hardware.
Storage Location
Models are stored in ~/.ollama/models/ (macOS/Linux) or C:\Users\{user}\.ollama\models\ (Windows). Each model is a GGUF blob file with a manifest. Disk usage can grow quickly for larger model variants, so storage planning matters.
# Pull without running ollama pull llama3.1:70b # See what's stored locally ollama list # NAME ID SIZE MODIFIED # llama3.2:latest a80c4f17... 2.0 GB 2 hours ago # llama3.1:70b .. 43 GB 1 day ago # Check running models ollama ps # Pull a specific quantization ollama pull llama3.1:8b-instruct-q4_K_M # Delete to free disk space ollama rm llama3.1:70b
Model IDs from HuggingFace: Ollama also supports running GGUF files directly: ollama run hf.co/bartowski/Llama-3.1-8B-Instruct-GGUF:Q4_K_M. This lets you run many compatible GGUF models from HuggingFace without requiring an Ollama library listing.
description
Modelfiles — Customizing Models
System prompts, parameters, and model composition
What a Modelfile Is
A Modelfile is a text file (like a Dockerfile for AI models) that customizes an Ollama model. You can: set a custom system prompt, change generation parameters (temperature, top_p, context length), and layer your customization on top of any base model.
Building a Custom Model
Create a Modelfile → run ollama create my-model -f Modelfile → run ollama run my-model. Your custom model appears in ollama list and behaves exactly like a built-in model — same CLI, same API.
# Modelfile example: custom coding assistant FROM llama3.2 # Set the system prompt SYSTEM """ You are an expert Python developer. You write clean, well-documented code and explain your reasoning clearly. Always include type hints and follow PEP 8. """ # Generation parameters PARAMETER temperature 0.3 PARAMETER top_p 0.9 PARAMETER num_ctx 8192 # Build and run # ollama create python-expert -f Modelfile # ollama run python-expert
Modelfile FROM: You can base a Modelfile on any installed Ollama model (FROM llama3.2), a GGUF file path (FROM /path/to/model.gguf), or a HuggingFace GGUF (FROM hf.co/bartowski/...). The system prompt and parameters layer on top.
memory
Hardware Acceleration & Performance
GPU support, inference speeds, and tuning
Automatic GPU Detection
Ollama detects and uses: Apple Silicon (Metal), NVIDIA CUDA, and AMD ROCm where supported by the host environment. Priority: GPU VRAM first, then overflow to CPU RAM.
Typical Performance
Performance depends on model size, quantization, hardware, and prompt shape. Benchmark on your own workload before setting production expectations.
VRAM Overflow to RAM
If a model does not fully fit VRAM, offloading to system memory may enable execution at lower throughput. Benchmark this under realistic context and concurrency settings.
Concurrency: Ollama can serve parallel requests and keep multiple models loaded, with behavior controlled by memory limits and settings such as OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS. For sustained high-concurrency production serving, use vLLM instead — it's purpose-built for throughput.
settings
Advanced Configuration
Environment variables, context windows, and tuning
Key Environment Variables
OLLAMA_HOST=0.0.0.0:11434 — expose the API on the network (default is localhost only). OLLAMA_MODELS=/path/to/models — change the model storage directory. OLLAMA_NUM_PARALLEL=4 — allow N parallel requests. OLLAMA_MAX_LOADED_MODELS=3 — keep multiple models warm in memory.
Context Length
Ollama defaults to the model's native context length as defined by the selected model and runtime settings. Override per-request: "num_ctx": 32768 in the API call body. Or set it in a Modelfile: PARAMETER num_ctx 32768. Longer context uses more VRAM.
# Expose on network (for team use) export OLLAMA_HOST=0.0.0.0:11434 ollama serve # Custom model storage export OLLAMA_MODELS=/data/ollama/models # Allow multiple parallel requests export OLLAMA_NUM_PARALLEL=4 # Keep multiple models warm export OLLAMA_MAX_LOADED_MODELS=3 # Per-request context extension curl http://localhost:11434/api/chat -d '{ "model": "llama3.1", "options": {"num_ctx": 32768}, "messages": [...] }'
Production limitation: Ollama is optimized for single-user, single-model local use. It lacks continuous batching, PagedAttention, and multi-GPU tensor parallelism. For sustained high-throughput or multi-tenant serving, vLLM is usually a better fit. Ollama is strongest for development and personal or team-local workflows.
hub
The Ollama Ecosystem
Integrations and what to build next
Popular Integrations
Open WebUI: self-hosted ChatGPT-like interface, connects to Ollama out of the box. LangChain: from langchain_ollama import ChatOllama — compatible with common LangChain chat, chain, and agent patterns. LlamaIndex: from llama_index.llms.ollama import Ollama. Continue.dev: VS Code/JetBrains plugin for AI coding with local models.
Python Direct Usage
The ollama Python package: pip install ollama. import ollama; response = ollama.chat(model='llama3.2', messages=[...]). The async client: from ollama import AsyncClient. Both wrap the REST API with pythonic interfaces and streaming support.
# Python library import ollama response = ollama.chat( model="llama3.2", messages=[{"role": "user", "content": "Hello!"}] ) print(response.message.content) # Streaming for chunk in ollama.chat( model="llama3.2", messages=[{"role":"user","content":"Tell me a story"}], stream=True ): print(chunk.message.content, end="", flush=True)
Next steps: Install Ollama → run ollama run llama3.2 → install Open WebUI for a full ChatGPT-like interface → integrate with LangChain for your first local RAG pipeline. Chapter 8 covers Open WebUI in detail, and Chapter 13 covers LangChain and LlamaIndex integration.