Ch 7 — Building Local AI Applications

From Ollama API to production apps — Python, JavaScript, RAG, and structured output
Applied
architecture
Patterns
arrow_forward
swap_horiz
API Swap
arrow_forward
code
Python
arrow_forward
javascript
JavaScript
arrow_forward
search
Local RAG
arrow_forward
data_object
Structured
arrow_forward
build
Project
arrow_forward
speed
Performance
-
Click play or press Space to begin...
Step- / 8
architecture
Architecture Patterns
Three ways to integrate local models into your applications
Pattern 1: Direct API
Your App → Ollama API → Model Simplest pattern. Your app calls Ollama's REST API on localhost:11434. Good for: single-user apps, scripts, CLI tools, prototypes.
Pattern 2: Framework Integration
Your App → LangChain → Ollama → Model Use LangChain, LlamaIndex, or similar frameworks. They handle prompt templates, chains, memory, and tool calling. Good for: RAG, agents, complex workflows.
Pattern 3: Embedded
Your App ←→ llama.cpp (C library) Link llama.cpp directly into your app. No separate server process. Maximum performance, minimum overhead. Good for: desktop apps, games, embedded.
Key insight: Start with Pattern 1 (direct API). It’s the simplest and works for most use cases. Graduate to Pattern 2 when you need RAG or complex chains. Pattern 3 is for performance-critical applications where you can’t afford the overhead of an HTTP API.
swap_horiz
The One-Line Swap: Cloud → Local
Change the base URL and your OpenAI code works with local models
Before: Cloud (OpenAI)
from openai import OpenAI client = OpenAI( api_key="sk-..." ) response = client.chat.completions.create( model="gpt-4o-mini", messages=[{ "role": "user", "content": "Summarize this document" }] ) print(response.choices[0].message.content)
After: Local (Ollama)
from openai import OpenAI client = OpenAI( base_url="http://localhost:11434/v1", api_key="ollama" # any string works ) response = client.chat.completions.create( model="qwen2.5:7b", messages=[{ "role": "user", "content": "Summarize this document" }] ) print(response.choices[0].message.content)
Key insight: Two lines changed: base_url and model. Everything else is identical. This means any tool, library, or application that supports the OpenAI API can work with local models. LangChain, Vercel AI SDK, AutoGen — all support this pattern.
code
Python: Complete Examples
Practical Python patterns for local AI applications
Document Summarizer
import ollama def summarize(text, bullets=3): response = ollama.chat( model='qwen2.5:7b', messages=[{ 'role': 'system', 'content': f'Summarize in {bullets} bullets.' }, { 'role': 'user', 'content': text }], options={'temperature': 0.3} ) return response['message']['content'] # Usage doc = open('report.txt').read() print(summarize(doc))
Streaming Chat Bot
import ollama history = [] while True: user_input = input("You: ") if user_input == "quit": break history.append({ 'role': 'user', 'content': user_input }) print("AI: ", end="") full = "" for chunk in ollama.chat( model='qwen2.5:7b', messages=history, stream=True ): text = chunk['message']['content'] print(text, end="", flush=True) full += text print() history.append({ 'role': 'assistant', 'content': full })
Key insight: The Ollama Python client (pip install ollama) is the fastest way to build local AI apps. Streaming support means your app feels responsive even with slower models. The chat history pattern gives you multi-turn conversations with context.
javascript
JavaScript: Node.js & Web
Building local AI into web applications and Electron apps
Node.js with Ollama
import { Ollama } from 'ollama'; const ollama = new Ollama(); // Simple generation const response = await ollama.chat({ model: 'qwen2.5:7b', messages: [{ role: 'user', content: 'Explain GGUF in one sentence' }] }); console.log(response.message.content); // Streaming const stream = await ollama.chat({ model: 'qwen2.5:7b', messages: [{ role: 'user', content: 'Tell me about quantization' }], stream: true }); for await (const chunk of stream) { process.stdout.write(chunk.message.content); }
Express.js API Server
import express from 'express'; import { Ollama } from 'ollama'; const app = express(); const ollama = new Ollama(); app.use(express.json()); app.post('/api/chat', async (req, res) => { const { message } = req.body; const response = await ollama.chat({ model: 'qwen2.5:7b', messages: [ { role: 'system', content: 'Be concise and helpful.' }, { role: 'user', content: message } ] }); res.json({ reply: response.message.content }); }); app.listen(3000);
Key insight: The Ollama JavaScript client (npm install ollama) works in Node.js. For browser-based apps, your frontend calls your backend, which calls Ollama. The browser never talks to Ollama directly (CORS + security). This is the standard pattern for web apps with local AI.
search
Local RAG: Ollama + ChromaDB
Retrieval-Augmented Generation with everything running locally
The Local RAG Stack
Your Documents ↓ chunk + embed ChromaDB (local vector store) ↓ similarity search Retrieved Context ↓ inject into prompt Ollama (local LLM) ↓ generate answer Response Everything runs on your machine. No data leaves your network. Total cost: $0/month.
Implementation
import chromadb, ollama # 1. Create vector store client = chromadb.Client() collection = client.create_collection("docs") # 2. Add documents (auto-embeds) collection.add( documents=["Doc 1 text...", "Doc 2 text..."], ids=["doc1", "doc2"] ) # 3. Query → retrieve → generate query = "What is our refund policy?" results = collection.query( query_texts=[query], n_results=3 ) context = "\n".join(results["documents"][0]) response = ollama.chat( model='qwen2.5:7b', messages=[{ 'role': 'user', 'content': f"Context:\n{context}\n\n" f"Question: {query}" }] )
Key insight: Local RAG is the killer app for local AI. Company knowledge bases, personal documents, code repositories — all searchable with AI, all private. ChromaDB runs in-process (no server needed). Combined with Ollama, you have a complete private AI assistant.
data_object
Structured Output: JSON Mode
Getting reliable JSON from local models — not just free text
Ollama JSON Mode
import ollama, json response = ollama.chat( model='qwen2.5:7b', messages=[{ 'role': 'user', 'content': '''Extract from this email: - sender_name - subject - urgency (low/medium/high) - action_required (true/false) Email: "Hi team, the production server is down. Need immediate fix. - Sarah" Return JSON only.''' }], format='json' ) data = json.loads(response['message']['content']) # {"sender_name": "Sarah", # "subject": "production server down", # "urgency": "high", # "action_required": true}
Grammar-Constrained Output
For even more reliable structured output, llama.cpp supports GBNF grammars — formal grammar rules that constrain the model’s output to valid JSON, specific schemas, or any defined format. The model physically cannot produce invalid output.

Ollama supports this via the format parameter with a JSON schema, ensuring the output always matches your expected structure.
Key insight: Structured output is what turns a chatbot into a data pipeline. Extract entities from emails, classify tickets, parse invoices — all locally, all returning clean JSON. Combined with RAG, you can build complete document processing systems that run entirely on your hardware.
build
Project: Local Document Q&A
A complete working system — drop in documents, ask questions
Architecture
local-qa/ ├── ingest.py # Load & chunk docs ├── query.py # Ask questions ├── documents/ # Drop files here └── requirements.txt Stack: Ollama (qwen2.5:7b) — generation Ollama (nomic-embed-text) — embeddings ChromaDB — vector storage LangChain — orchestration Flow: 1. ingest.py: reads PDFs/TXTs from documents/, chunks them, embeds with nomic-embed-text, stores in ChromaDB 2. query.py: takes a question, finds relevant chunks, sends to qwen2.5 with context, returns answer
Key Code (query.py)
from langchain_ollama import ( ChatOllama, OllamaEmbeddings ) from langchain_chroma import Chroma embeddings = OllamaEmbeddings( model="nomic-embed-text" ) db = Chroma( persist_directory="./chroma_db", embedding_function=embeddings ) llm = ChatOllama(model="qwen2.5:7b") # Retrieve + Generate docs = db.similarity_search(query, k=4) context = "\n\n".join(d.page_content for d in docs) answer = llm.invoke( f"Context:\n{context}\n\nQ: {query}" )
Key insight: This project is a template you can adapt for any domain: HR policy Q&A, codebase search, research paper assistant, customer support knowledge base. The pattern is always the same: ingest → embed → store → retrieve → generate. All local.
speed
Performance Expectations
What to expect from local models on consumer hardware
Tokens/sec by Hardware
Qwen 2.5 7B Q4_K_M: RTX 4090: ~95 tok/s (instant feel) RTX 4070: ~65 tok/s (fast) M2 Pro 16GB: ~45 tok/s (comfortable) M1 8GB: ~25 tok/s (usable) i7 CPU-only: ~8 tok/s (slow but works) Llama 3.2 3B Q4_K_M: RTX 4090: ~180 tok/s M2 Pro: ~80 tok/s M1 8GB: ~50 tok/s i7 CPU-only: ~15 tok/s
User Experience Thresholds
>60 tok/s: Feels instant. Text appears as fast as you can read it. Ideal for interactive apps.

30–60 tok/s: Comfortable. Slight delay visible but not annoying. Good for chat interfaces.

10–30 tok/s: Usable. Noticeable generation delay. Fine for batch processing, background tasks.

<10 tok/s: Slow. Only viable for non-interactive use (overnight batch processing, scheduled tasks).
Key insight: For interactive applications, target 30+ tok/s. This means: 3B model on any modern hardware, 7B model on Apple Silicon or discrete GPU, 14B+ model on RTX 4070 or better. Match your model size to your hardware for the best user experience. Next: Chapter 8 takes this to phones and browsers.