Ch 6: Running LLMs Locally with Ollama

Ch 6 — Running LLMs Locally with Ollama

One command, any model, zero configuration — local AI made simple

Index ← Prev Next →

Local AI

download

Pull

arrow_forward

inventory

Cache

arrow_forward

memory

Load

arrow_forward

smart_toy

Infer

arrow_forward

api

API

Click play or press Space to begin the journey...

Step- / 7

terminal

Getting Started with Ollama

Install, run your first model, zero to AI in 2 minutes

Installation

Mac: brew install ollama or download from ollama.com. Linux: curl -fsSL https://ollama.com/install.sh | sh. Windows: download the installer from ollama.com. Ollama runs as a background service — ollama serve starts it manually if it's not already running.

Your First Model

ollama run llama3.2 — this downloads the selected model and drops you into an interactive chat session. ollama run gemma3:2b for a smaller model. ollama run qwen3:8b as another supported model option. Every model in the Ollama library can be pulled this way.

# Install (macOS) brew install ollama # Run your first model ollama run llama3.2 # Non-interactive: pass a prompt ollama run llama3.2 "Explain RAG in one sentence" # List installed models ollama list # Remove a model ollama rm llama3.2 # Show model info ollama show llama3.2

First run is slow — Ollama downloads the model artifact before first use. Subsequent runs are instant because the model is cached in ~/.ollama/models. On Apple Silicon, Metal GPU acceleration is automatic — no configuration needed.

dns

The Ollama REST API

Programmatic access on port 11434

Starting the Server

ollama serve starts the REST API server on http://localhost:11434. On macOS, it starts automatically on login. On Linux, enable with systemctl enable ollama. The API includes OpenAI-compatible endpoints for common workflows — changing base_url='http://localhost:11434/v1' works for many clients, but verify endpoint parity for your app.

Core Endpoints

POST /api/generate — single-turn generation. POST /api/chat — multi-turn chat with messages array. POST /api/embed — generate embeddings. GET /api/tags — list installed models. POST /api/pull — download a model. All responses stream by default.

import requests # Chat API (OpenAI-compatible format) response = requests.post( "http://localhost:11434/api/chat", json={ "model": "llama3.2", "messages": [ {"role": "user", "content": "What is RAG?"} ], "stream": False } ) print(response.json()["message"]["content"]) # Or use the OpenAI client from openai import OpenAI client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama") resp = client.chat.completions.create( model="llama3.2", messages=[{"role":"user","content":"Hello!"}] )

Streaming responses: By default, Ollama streams tokens as they're generated. Set "stream": false to get the full response at once. For production applications, streaming gives better user experience — you can display tokens as they arrive.

folder

Model Management

Pulling, listing, and organizing local models

The Ollama Model Library

Ollama maintains a curated library at ollama.com/library — pre-quantized GGUF versions of popular models. Each model has tagged variants: llama3.2:3b, llama3.2:3b-instruct-q4_K_M, llama3.2:latest. The :latest tag defaults to the recommended quantization for most hardware.

Storage Location

Models are stored in ~/.ollama/models/ (macOS/Linux) or C:\Users\{user}\.ollama\models\ (Windows). Each model is a GGUF blob file with a manifest. Disk usage can grow quickly for larger model variants, so storage planning matters.

# Pull without running ollama pull llama3.1:70b # See what's stored locally ollama list # NAME ID SIZE MODIFIED # llama3.2:latest a80c4f17... 2.0 GB 2 hours ago # llama3.1:70b .. 43 GB 1 day ago # Check running models ollama ps # Pull a specific quantization ollama pull llama3.1:8b-instruct-q4_K_M # Delete to free disk space ollama rm llama3.1:70b

Model IDs from HuggingFace: Ollama also supports running GGUF files directly: ollama run hf.co/bartowski/Llama-3.1-8B-Instruct-GGUF:Q4_K_M. This lets you run many compatible GGUF models from HuggingFace without requiring an Ollama library listing.

description

Modelfiles — Customizing Models

System prompts, parameters, and model composition

What a Modelfile Is

A Modelfile is a text file (like a Dockerfile for AI models) that customizes an Ollama model. You can: set a custom system prompt, change generation parameters (temperature, top_p, context length), and layer your customization on top of any base model.

Building a Custom Model

Create a Modelfile → run ollama create my-model -f Modelfile → run ollama run my-model. Your custom model appears in ollama list and behaves exactly like a built-in model — same CLI, same API.

# Modelfile example: custom coding assistant FROM llama3.2 # Set the system prompt SYSTEM """ You are an expert Python developer. You write clean, well-documented code and explain your reasoning clearly. Always include type hints and follow PEP 8. """ # Generation parameters PARAMETER temperature 0.3 PARAMETER top_p 0.9 PARAMETER num_ctx 8192 # Build and run # ollama create python-expert -f Modelfile # ollama run python-expert

Modelfile FROM: You can base a Modelfile on any installed Ollama model (FROM llama3.2), a GGUF file path (FROM /path/to/model.gguf), or a HuggingFace GGUF (FROM hf.co/bartowski/...). The system prompt and parameters layer on top.

memory

Hardware Acceleration & Performance

GPU support, inference speeds, and tuning

Automatic GPU Detection

Ollama detects and uses: Apple Silicon (Metal), NVIDIA CUDA, and AMD ROCm where supported by the host environment. Priority: GPU VRAM first, then overflow to CPU RAM.

Typical Performance

Performance depends on model size, quantization, hardware, and prompt shape. Benchmark on your own workload before setting production expectations.

VRAM Overflow to RAM

If a model does not fully fit VRAM, offloading to system memory may enable execution at lower throughput. Benchmark this under realistic context and concurrency settings.

Concurrency: Ollama can serve parallel requests and keep multiple models loaded, with behavior controlled by memory limits and settings such as OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS. For sustained high-concurrency production serving, use vLLM instead — it's purpose-built for throughput.

settings

Advanced Configuration

Environment variables, context windows, and tuning

Key Environment Variables

OLLAMA_HOST=0.0.0.0:11434 — expose the API on the network (default is localhost only). OLLAMA_MODELS=/path/to/models — change the model storage directory. OLLAMA_NUM_PARALLEL=4 — allow N parallel requests. OLLAMA_MAX_LOADED_MODELS=3 — keep multiple models warm in memory.

Context Length

Ollama defaults to the model's native context length as defined by the selected model and runtime settings. Override per-request: "num_ctx": 32768 in the API call body. Or set it in a Modelfile: PARAMETER num_ctx 32768. Longer context uses more VRAM.

# Expose on network (for team use) export OLLAMA_HOST=0.0.0.0:11434 ollama serve # Custom model storage export OLLAMA_MODELS=/data/ollama/models # Allow multiple parallel requests export OLLAMA_NUM_PARALLEL=4 # Keep multiple models warm export OLLAMA_MAX_LOADED_MODELS=3 # Per-request context extension curl http://localhost:11434/api/chat -d '{ "model": "llama3.1", "options": {"num_ctx": 32768}, "messages": [...] }'

Production limitation: Ollama is optimized for single-user, single-model local use. It lacks continuous batching, PagedAttention, and multi-GPU tensor parallelism. For sustained high-throughput or multi-tenant serving, vLLM is usually a better fit. Ollama is strongest for development and personal or team-local workflows.

hub

The Ollama Ecosystem

Integrations and what to build next

Popular Integrations

Open WebUI: self-hosted ChatGPT-like interface, connects to Ollama out of the box. LangChain: from langchain_ollama import ChatOllama — compatible with common LangChain chat, chain, and agent patterns. LlamaIndex: from llama_index.llms.ollama import Ollama. Continue.dev: VS Code/JetBrains plugin for AI coding with local models.

Python Direct Usage

The ollama Python package: pip install ollama. import ollama; response = ollama.chat(model='llama3.2', messages=[...]). The async client: from ollama import AsyncClient. Both wrap the REST API with pythonic interfaces and streaming support.

# Python library import ollama response = ollama.chat( model="llama3.2", messages=[{"role": "user", "content": "Hello!"}] ) print(response.message.content) # Streaming for chunk in ollama.chat( model="llama3.2", messages=[{"role":"user","content":"Tell me a story"}], stream=True ): print(chunk.message.content, end="", flush=True)

Next steps: Install Ollama → run ollama run llama3.2 → install Open WebUI for a full ChatGPT-like interface → integrate with LangChain for your first local RAG pipeline. Chapter 8 covers Open WebUI in detail, and Chapter 13 covers LangChain and LlamaIndex integration.