Ch 15: Fine-Tuning vs. Foundation — Build, Buy, or Customize

Ch 15 — Fine-Tuning vs. Foundation: Build, Buy, or Customize

The franchise decision of AI — when to use off-the-shelf, when to customize, and when to build from scratch

Index

High Level

cloud

API

arrow_forward

edit_note

Prompt

arrow_forward

tune

Fine-Tune

arrow_forward

compress

Distill

arrow_forward

dns

Self-Host

arrow_forward

calculate

Evaluate

Click play or press Space to begin...

Step- / 8

storefront

The Spectrum of Customization

Five levels from off-the-shelf to fully custom — and why most organizations over-invest

The Five Levels

AI customization is not a binary choice. It’s a spectrum with five distinct levels, each with different costs, timelines, and capabilities:

Level 1: API + Prompting — Use a foundation model via API with well-crafted prompts. Zero setup cost. Sufficient for 60–70% of enterprise use cases.

Level 2: RAG (Retrieval-Augmented Generation) — Ground the model in your proprietary data without changing the model itself. $0.01–$0.05 per query.

Level 3: Fine-tuning (LoRA/PEFT) — Adjust the model’s behavior on your specific tasks. $500–$2,000 for a 70B model.

Levels (Continued)

Level 4: Continued pre-training — Teach the model an entirely new domain (medical, legal, financial). $10,000+. Requires substantial domain-specific data.

Level 5: Training from scratch — Build a foundation model from the ground up. $10M–$100M+. Only justified for organizations with unique data at massive scale and deep ML expertise.

Key insight: The most common and most expensive mistake in enterprise AI is jumping to Level 3–5 before exhausting Level 1–2. The binary “RAG vs. fine-tuning” debate is outdated. Modern enterprises use a layered approach: start with prompting, add RAG for proprietary data, layer prompt caching for cost reduction, and only then consider fine-tuning for the specific tasks where lower levels fall short.

cloud

Level 1: Foundation Model APIs

The default starting point for every enterprise

How It Works

You send requests to a model hosted by a provider (OpenAI, Anthropic, Google) and pay per token. No infrastructure to manage, no ML expertise required. The model is shared across all customers, but your data is processed and returned — not used for training (under enterprise agreements). You customize behavior entirely through system prompts, few-shot examples, and output format constraints (Chapter 16).

Current Pricing

GPT-4o — ~$4–5 per million tokens.
Claude Sonnet — ~$3 per million input tokens.
Gemini Flash — ~$0.075 per million tokens (the cost leader).

For context: 1 million tokens is roughly 750,000 words — about 10 full-length novels. At Gemini Flash pricing, processing 10 novels costs less than a cup of coffee.

When APIs Are Sufficient

General-purpose tasks — Summarization, translation, drafting, Q&A, analysis.
Variable workloads — You pay only for what you use. No idle infrastructure.
Rapid experimentation — Switch between models in minutes, not months.
Multi-model strategies — Route different tasks to different models based on complexity and cost.

Key insight: At low-to-medium volume (<10M tokens/day), APIs are almost always the right choice. A team processing 2M tokens/day pays roughly $620/month via API. The same workload on self-hosted infrastructure costs $41,000/month. APIs win decisively until you hit enterprise scale — and even then, the operational simplicity often justifies the premium.

tune

Level 3: Fine-Tuning — Teaching the Model Your Business

When and how to customize a foundation model for your specific needs

What Fine-Tuning Actually Does

Fine-tuning takes a pre-trained foundation model and adjusts its weights using your specific data. The model retains its broad knowledge but develops specialized expertise in your domain, terminology, and output style. Modern fine-tuning uses LoRA (Low-Rank Adaptation) — a technique that modifies only a small fraction of the model’s parameters (typically 0.1–1%), making it dramatically cheaper and faster than full fine-tuning.

Cost Architecture

LoRA fine-tuning (7B model) — $50–$500 one-time.
LoRA fine-tuning (70B model) — $500–$2,000 one-time.
Full fine-tuning — $5,000–$50,000.
Enterprise-scale fine-tuning — $12,000–$180,000.

These costs are amortized across millions of inference calls. The per-request cost drops because fine-tuned models need 50–90% shorter prompts — the knowledge is baked into the model, not repeated in every prompt.

What Fine-Tuning Excels At

Output consistency — 98–99.5% format adherence vs. 85–95% with prompting alone.
Domain terminology — Medical, legal, or technical vocabulary used correctly and naturally.
Behavioral patterns — Specific tone, decision-making style, or response structure.
Cost reduction at scale — Processing 10,000 documents daily: ~$50K/year via API vs. ~$5K with fine-tuned on-premises model. A 90% cost reduction.

Key insight: Fine-tuning is not about making the model “smarter.” It’s about making it more consistent, more efficient, and more aligned with your specific needs. A fine-tuned small model (7B parameters) outperforms zero-shot GPT-4 on approximately 80% of classification tasks — not because it’s more intelligent, but because it’s more focused.

compress

Knowledge Distillation: The Best of Both Worlds

Using a large model to train a small, fast, cheap one

How Distillation Works

Knowledge distillation uses a large, expensive model (the “teacher”) to generate training data for a small, cheap model (the “student”). You run your use case through GPT-4 or Claude to produce thousands of high-quality examples, then fine-tune a small open-source model (like Phi-3.5 or LLaMA 8B) on those examples. The student model learns to mimic the teacher’s behavior on your specific task at a fraction of the cost.

The Economics

A distilled 8B model typically achieves 90–95% of the quality of the 70B teacher at a 25× reduction in inference cost. For high-volume production tasks where the quality bar is “good enough, consistently,” this is transformative. A task that costs $10 per 1M tokens via GPT-4 Turbo costs $0.40 with a distilled model on modest hardware.

When to Use Distillation

High-volume, well-defined tasks — Classification, extraction, formatting, routing. Tasks where the output space is bounded and predictable.

Latency-sensitive applications — Small models run in 50–80ms vs. 400ms+ for cloud APIs. For real-time applications (chatbots, search, fraud detection), this latency difference is significant.

Data-sensitive environments — The distilled model runs on your infrastructure. No data leaves your network.

Key insight: Distillation is the enterprise sweet spot for organizations that have validated a use case with a large model and now need to scale it economically. It combines the intelligence of frontier models with the cost and control of small models. Think of it as using a master chef to write a recipe book that any competent cook can follow.

gpp_bad

The Risks of Fine-Tuning

What can go wrong and why many fine-tuning projects fail

Distribution Collapse

Fine-tuned models excel within their training distribution but experience complete accuracy collapse outside it. Air Canada’s chatbot invented refund policies that didn’t exist. Amazon’s Rufus product assistant matched only 32% of products accurately when queries deviated from training patterns. The model becomes a specialist that is confidently wrong when asked anything outside its narrow expertise.

Data Quality Requirements

Fine-tuning amplifies whatever is in your training data — including errors, biases, and inconsistencies. You need at minimum 1,000 high-quality examples, and ideally 5,000–10,000 for production-grade results. Curating this data is often the most expensive and time-consuming part of the process, yet it’s frequently underestimated in project planning.

Ongoing Maintenance

Model drift — As your business changes, the fine-tuned model becomes stale. You need a retraining pipeline, not a one-time project.
Evaluation infrastructure — You need automated tests to detect regression when you retrain.
Version management — Multiple model versions in production, with rollback capability.
Talent requirements — Fine-tuning requires ML engineering expertise that commands $200K–$400K salaries.

Critical for leaders: The upfront cost of fine-tuning ($500–$50,000) is misleadingly small. The total cost of ownership includes data curation, evaluation infrastructure, retraining pipelines, ML engineering talent, and ongoing monitoring. Many organizations discover that the “cheap” fine-tuning project costs 5–10× the training cost in surrounding infrastructure and talent.

dns

Self-Hosting: Open-Source Models on Your Infrastructure

When data sovereignty and control outweigh convenience

The Open-Source Option

Models like Meta’s LLaMA, Mistral, and DeepSeek can be downloaded and run on your own infrastructure. No data leaves your network. No per-token fees. No dependency on a third-party provider. You have full control over the model, its behavior, and its availability. The trade-off: you own the infrastructure, the operations, and the expertise required to run it.

Infrastructure Costs

Entry-level (7B model) — A single A10G GPU (~$1,500/month cloud). Handles moderate traffic.
Production (70B model) — Multiple A100 GPUs (~$33,000/month cloud). Enterprise-scale throughput.
High-performance cluster — 8× A100 instance (~$32,770/month). Required for large models or high concurrency.

These costs are fixed regardless of usage, making self-hosting economical only at high volume (10M+ tokens/day).

When Self-Hosting Makes Sense

Regulated industries — Healthcare, defense, financial services where data cannot leave your infrastructure under any circumstances.
High-volume production — At 50M tokens/day, self-hosted costs $2.20/M tokens vs. $10.00 for GPT-4 Turbo API.
Latency requirements — On-premises models eliminate network round-trips (50–80ms vs. 400ms+).
Customization needs — Full control over fine-tuning, quantization, and serving configuration.

Key insight: Self-hosting is not “free.” You’re trading API costs for infrastructure costs, operations costs, and talent costs. The break-even point is approximately 10M tokens/day. Below that, APIs are cheaper. Above that, self-hosting saves $3,000–$6,000/month — but only if you have the team to manage it. Most organizations underestimate the operational burden.

account_tree

The Decision Framework

A systematic approach to choosing the right level of customization

The Strategic Workflow

Step 1: Start with API + prompting. Test your use case with a foundation model and well-crafted prompts. If quality meets your bar, stop here. This is the right answer for most use cases.

Step 2: Add RAG for proprietary data. If the model needs access to your documents, knowledge base, or real-time data, add retrieval (Chapter 17). This keeps the model current without retraining.

Step 3: Layer prompt caching. For static system prompts and policies, caching reduces costs by ~90% after the first call.

Workflow (Continued)

Step 4: Fine-tune only if Steps 1–3 fall short. Fine-tune when you need >95% output consistency, domain-specific behavior at scale, or significant cost reduction on high-volume tasks.

Step 5: Distill for production scale. Once a fine-tuned large model validates the use case, distill to a smaller model for cost-effective deployment.

Step 6: Self-host only with clear justification. Data sovereignty requirements, extreme volume (>10M tokens/day), or latency constraints that APIs cannot meet.

Key insight: Each step should be justified by the failure of the previous step. If prompting works, don’t fine-tune. If RAG works, don’t pre-train. If cloud APIs meet your needs, don’t self-host. The most successful enterprise AI teams are those that resist the urge to over-engineer and choose the simplest approach that meets their requirements.

balance

The Executive Decision Matrix

Matching your situation to the right approach

Quick Decision Guide

Volume <100K requests/month? → API + prompting. Don’t over-invest.

Need proprietary data access? → Add RAG. Keep the model general, make the data specific.

Output consistency <95%? → Consider fine-tuning. The 85% → 99% consistency jump justifies the investment for production systems.

Volume >100K requests/month on a validated task? → Fine-tune + distill. The cost savings compound at scale.

Guide (Continued)

Data cannot leave your network? → Self-host open-source models. Accept the operational overhead.

Unique domain with no public training data? → Continued pre-training. Rare but necessary for specialized fields.

Tasks change frequently? → Stay with APIs. Fine-tuned models are rigid; prompts are flexible. Don’t fine-tune a moving target.

The bottom line: The right level of AI customization is the minimum level that meets your requirements. Foundation model APIs are getting better and cheaper every quarter. What required fine-tuning last year may be achievable with prompting today. Before committing to customization, always ask: “Will next quarter’s foundation model make this unnecessary?” If the answer is “probably,” wait.

arrow_back Ch 14: Large Language Models Ch 16: Prompt Engineering arrow_forward