brew install ollama or download from ollama.com. Linux: curl -fsSL https://ollama.com/install.sh | sh. Windows: download the installer from ollama.com. Ollama runs as a background service — ollama serve starts it manually if it's not already running.ollama run llama3.2 — this downloads the selected model and drops you into an interactive chat session. ollama run gemma3:2b for a smaller model. ollama run qwen3:8b as another supported model option. Every model in the Ollama library can be pulled this way.~/.ollama/models. On Apple Silicon, Metal GPU acceleration is automatic — no configuration needed.ollama serve starts the REST API server on http://localhost:11434. On macOS, it starts automatically on login. On Linux, enable with systemctl enable ollama. The API includes OpenAI-compatible endpoints for common workflows — changing base_url='http://localhost:11434/v1' works for many clients, but verify endpoint parity for your app.POST /api/generate — single-turn generation. POST /api/chat — multi-turn chat with messages array. POST /api/embed — generate embeddings. GET /api/tags — list installed models. POST /api/pull — download a model. All responses stream by default."stream": false to get the full response at once. For production applications, streaming gives better user experience — you can display tokens as they arrive.llama3.2:3b, llama3.2:3b-instruct-q4_K_M, llama3.2:latest. The :latest tag defaults to the recommended quantization for most hardware.~/.ollama/models/ (macOS/Linux) or C:\Users\{user}\.ollama\models\ (Windows). Each model is a GGUF blob file with a manifest. Disk usage can grow quickly for larger model variants, so storage planning matters.ollama run hf.co/bartowski/Llama-3.1-8B-Instruct-GGUF:Q4_K_M. This lets you run many compatible GGUF models from HuggingFace without requiring an Ollama library listing.Modelfile → run ollama create my-model -f Modelfile → run ollama run my-model. Your custom model appears in ollama list and behaves exactly like a built-in model — same CLI, same API.FROM llama3.2), a GGUF file path (FROM /path/to/model.gguf), or a HuggingFace GGUF (FROM hf.co/bartowski/...). The system prompt and parameters layer on top.OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS. For sustained high-concurrency production serving, use vLLM instead — it's purpose-built for throughput.OLLAMA_HOST=0.0.0.0:11434 — expose the API on the network (default is localhost only). OLLAMA_MODELS=/path/to/models — change the model storage directory. OLLAMA_NUM_PARALLEL=4 — allow N parallel requests. OLLAMA_MAX_LOADED_MODELS=3 — keep multiple models warm in memory."num_ctx": 32768 in the API call body. Or set it in a Modelfile: PARAMETER num_ctx 32768. Longer context uses more VRAM.from langchain_ollama import ChatOllama — compatible with common LangChain chat, chain, and agent patterns. LlamaIndex: from llama_index.llms.ollama import Ollama. Continue.dev: VS Code/JetBrains plugin for AI coding with local models.ollama Python package: pip install ollama. import ollama; response = ollama.chat(model='llama3.2', messages=[...]). The async client: from ollama import AsyncClient. Both wrap the REST API with pythonic interfaces and streaming support.ollama run llama3.2 → install Open WebUI for a full ChatGPT-like interface → integrate with LangChain for your first local RAG pipeline. Chapter 8 covers Open WebUI in detail, and Chapter 13 covers LangChain and LlamaIndex integration.