Ch 7: Building Local AI Applications — Small Models & Local AI

Ch 7 — Building Local AI Applications

From Ollama API to production apps — Python, JavaScript, RAG, and structured output

arrow_backIndex

Applied

architecture

Patterns

arrow_forward

swap_horiz

API Swap

arrow_forward

code

Python

arrow_forward

javascript

JavaScript

arrow_forward

Local RAG

arrow_forward

data_object

Structured

arrow_forward

build

Project

arrow_forward

speed

Performance

Click play or press Space to begin...

Step- / 8

architecture

Architecture Patterns

Three ways to integrate local models into your applications

Pattern 1: Direct API

Your App → Ollama API → Model Simplest pattern. Your app calls Ollama's REST API on localhost:11434. Good for: single-user apps, scripts, CLI tools, prototypes.

Pattern 2: Framework Integration

Your App → LangChain → Ollama → Model Use LangChain, LlamaIndex, or similar frameworks. They handle prompt templates, chains, memory, and tool calling. Good for: RAG, agents, complex workflows.

Pattern 3: Embedded

Your App ←→ llama.cpp (C library) Link llama.cpp directly into your app. No separate server process. Maximum performance, minimum overhead. Good for: desktop apps, games, embedded.

Key insight: Start with Pattern 1 (direct API). It’s the simplest and works for most use cases. Graduate to Pattern 2 when you need RAG or complex chains. Pattern 3 is for performance-critical applications where you can’t afford the overhead of an HTTP API.

swap_horiz

The One-Line Swap: Cloud → Local

Change the base URL and your OpenAI code works with local models

Before: Cloud (OpenAI)

from openai import OpenAI client = OpenAI( api_key="sk-..." ) response = client.chat.completions.create( model="gpt-4o-mini", messages=[{ "role": "user", "content": "Summarize this document" }] ) print(response.choices[0].message.content)

After: Local (Ollama)

from openai import OpenAI client = OpenAI( base_url="http://localhost:11434/v1", api_key="ollama" # any string works ) response = client.chat.completions.create( model="qwen2.5:7b", messages=[{ "role": "user", "content": "Summarize this document" }] ) print(response.choices[0].message.content)

Key insight: Two lines changed: base_url and model. Everything else is identical. This means any tool, library, or application that supports the OpenAI API can work with local models. LangChain, Vercel AI SDK, AutoGen — all support this pattern.

code

Python: Complete Examples

Practical Python patterns for local AI applications

Document Summarizer

import ollama def summarize(text, bullets=3): response = ollama.chat( model='qwen2.5:7b', messages=[{ 'role': 'system', 'content': f'Summarize in {bullets} bullets.' }, { 'role': 'user', 'content': text }], options={'temperature': 0.3} ) return response['message']['content'] # Usage doc = open('report.txt').read() print(summarize(doc))

Streaming Chat Bot

import ollama history = [] while True: user_input = input("You: ") if user_input == "quit": break history.append({ 'role': 'user', 'content': user_input }) print("AI: ", end="") full = "" for chunk in ollama.chat( model='qwen2.5:7b', messages=history, stream=True ): text = chunk['message']['content'] print(text, end="", flush=True) full += text print() history.append({ 'role': 'assistant', 'content': full })

Key insight: The Ollama Python client (pip install ollama) is the fastest way to build local AI apps. Streaming support means your app feels responsive even with slower models. The chat history pattern gives you multi-turn conversations with context.

javascript

JavaScript: Node.js & Web

Building local AI into web applications and Electron apps

Node.js with Ollama

import { Ollama } from 'ollama'; const ollama = new Ollama(); // Simple generation const response = await ollama.chat({ model: 'qwen2.5:7b', messages: [{ role: 'user', content: 'Explain GGUF in one sentence' }] }); console.log(response.message.content); // Streaming const stream = await ollama.chat({ model: 'qwen2.5:7b', messages: [{ role: 'user', content: 'Tell me about quantization' }], stream: true }); for await (const chunk of stream) { process.stdout.write(chunk.message.content); }

Express.js API Server

import express from 'express'; import { Ollama } from 'ollama'; const app = express(); const ollama = new Ollama(); app.use(express.json()); app.post('/api/chat', async (req, res) => { const { message } = req.body; const response = await ollama.chat({ model: 'qwen2.5:7b', messages: [ { role: 'system', content: 'Be concise and helpful.' }, { role: 'user', content: message } ] }); res.json({ reply: response.message.content }); }); app.listen(3000);

Key insight: The Ollama JavaScript client (npm install ollama) works in Node.js. For browser-based apps, your frontend calls your backend, which calls Ollama. The browser never talks to Ollama directly (CORS + security). This is the standard pattern for web apps with local AI.

Local RAG: Ollama + ChromaDB

Retrieval-Augmented Generation with everything running locally

The Local RAG Stack

Your Documents ↓ chunk + embed ChromaDB (local vector store) ↓ similarity search Retrieved Context ↓ inject into prompt Ollama (local LLM) ↓ generate answer Response Everything runs on your machine. No data leaves your network. Total cost: $0/month.

Implementation

import chromadb, ollama # 1. Create vector store client = chromadb.Client() collection = client.create_collection("docs") # 2. Add documents (auto-embeds) collection.add( documents=["Doc 1 text...", "Doc 2 text..."], ids=["doc1", "doc2"] ) # 3. Query → retrieve → generate query = "What is our refund policy?" results = collection.query( query_texts=[query], n_results=3 ) context = "\n".join(results["documents"][0]) response = ollama.chat( model='qwen2.5:7b', messages=[{ 'role': 'user', 'content': f"Context:\n{context}\n\n" f"Question: {query}" }] )

Key insight: Local RAG is the killer app for local AI. Company knowledge bases, personal documents, code repositories — all searchable with AI, all private. ChromaDB runs in-process (no server needed). Combined with Ollama, you have a complete private AI assistant.

data_object

Structured Output: JSON Mode

Getting reliable JSON from local models — not just free text

Ollama JSON Mode

import ollama, json response = ollama.chat( model='qwen2.5:7b', messages=[{ 'role': 'user', 'content': '''Extract from this email: - sender_name - subject - urgency (low/medium/high) - action_required (true/false) Email: "Hi team, the production server is down. Need immediate fix. - Sarah" Return JSON only.''' }], format='json' ) data = json.loads(response['message']['content']) # {"sender_name": "Sarah", # "subject": "production server down", # "urgency": "high", # "action_required": true}

Grammar-Constrained Output

For even more reliable structured output, llama.cpp supports GBNF grammars — formal grammar rules that constrain the model’s output to valid JSON, specific schemas, or any defined format. The model physically cannot produce invalid output.

Ollama supports this via the format parameter with a JSON schema, ensuring the output always matches your expected structure.

Key insight: Structured output is what turns a chatbot into a data pipeline. Extract entities from emails, classify tickets, parse invoices — all locally, all returning clean JSON. Combined with RAG, you can build complete document processing systems that run entirely on your hardware.

build

Project: Local Document Q&A

A complete working system — drop in documents, ask questions

Architecture

local-qa/ ├── ingest.py # Load & chunk docs ├── query.py # Ask questions ├── documents/ # Drop files here └── requirements.txt Stack: Ollama (qwen2.5:7b) — generation Ollama (nomic-embed-text) — embeddings ChromaDB — vector storage LangChain — orchestration Flow: 1. ingest.py: reads PDFs/TXTs from documents/, chunks them, embeds with nomic-embed-text, stores in ChromaDB 2. query.py: takes a question, finds relevant chunks, sends to qwen2.5 with context, returns answer

Key Code (query.py)

from langchain_ollama import ( ChatOllama, OllamaEmbeddings ) from langchain_chroma import Chroma embeddings = OllamaEmbeddings( model="nomic-embed-text" ) db = Chroma( persist_directory="./chroma_db", embedding_function=embeddings ) llm = ChatOllama(model="qwen2.5:7b") # Retrieve + Generate docs = db.similarity_search(query, k=4) context = "\n\n".join(d.page_content for d in docs) answer = llm.invoke( f"Context:\n{context}\n\nQ: {query}" )

Key insight: This project is a template you can adapt for any domain: HR policy Q&A, codebase search, research paper assistant, customer support knowledge base. The pattern is always the same: ingest → embed → store → retrieve → generate. All local.

speed

Performance Expectations

What to expect from local models on consumer hardware

Tokens/sec by Hardware

Qwen 2.5 7B Q4_K_M: RTX 4090: ~95 tok/s (instant feel) RTX 4070: ~65 tok/s (fast) M2 Pro 16GB: ~45 tok/s (comfortable) M1 8GB: ~25 tok/s (usable) i7 CPU-only: ~8 tok/s (slow but works) Llama 3.2 3B Q4_K_M: RTX 4090: ~180 tok/s M2 Pro: ~80 tok/s M1 8GB: ~50 tok/s i7 CPU-only: ~15 tok/s

User Experience Thresholds

>60 tok/s: Feels instant. Text appears as fast as you can read it. Ideal for interactive apps.

30–60 tok/s: Comfortable. Slight delay visible but not annoying. Good for chat interfaces.

10–30 tok/s: Usable. Noticeable generation delay. Fine for batch processing, background tasks.

<10 tok/s: Slow. Only viable for non-interactive use (overnight batch processing, scheduled tasks).

Key insight: For interactive applications, target 30+ tok/s. This means: 3B model on any modern hardware, 7B model on Apple Silicon or discrete GPU, 14B+ model on RTX 4070 or better. Match your model size to your hardware for the best user experience. Next: Chapter 8 takes this to phones and browsers.

arrow_back Ch 6: llama.cpp & GGUF Ch 8: Edge Deployment arrow_forward