Ch 5: The GPU & Infrastructure Layer

home

The Real Estate Analogy

Renting an apartment vs buying a house vs building from scratch

Three Options

Renting (API) — Pay per use, no maintenance, move anytime. Like renting an apartment: you pay monthly, the landlord handles plumbing and repairs. Buying (Self-hosting) — Purchase GPUs, run your own models. Like buying a house: big upfront cost, you handle maintenance, but no monthly rent. Building (Custom infra) — Design custom hardware and software stacks. Like building a house from the ground up: maximum control, maximum complexity.

The Decision Framework

Most teams should rent (use APIs). APIs win for 87% of use cases because they eliminate infrastructure complexity, scale instantly, and require zero GPU expertise. Self-hosting only makes sense at extreme volume (>10B tokens/month) or with strict regulatory requirements (HIPAA, data sovereignty). Building custom infrastructure is reserved for hyperscalers and AI labs.

Key insight: The most common mistake is self-hosting too early. Teams underestimate the hidden costs (operations, monitoring, security, model updates) by 3–5x. Start with APIs, measure your actual volume, and only self-host when the math clearly justifies it.

memory

NVIDIA’s GPU Stack

H100, H200, B200, and the Vera Rubin roadmap

The GPU Lineup (2024–2026)

// NVIDIA datacenter GPU stack H100 (2023) 80GB HBM3 Workhorse H200 (2024) 141GB HBM3e Memory upgrade B200 (2025) 192GB HBM3e Blackwell arch GB200 (2025) 384GB combined Superchip Vera Rubin (2026+) Next gen

Why Memory Matters

GPU memory (HBM) determines how large a model can fit on a single chip. A 70B parameter model needs ~140GB in FP16, requiring 2 H100s but only 1 H200. Fewer GPUs = lower cost, simpler infrastructure, and better utilization. The memory race is as important as the compute race.

Performance Leaps

Each generation delivers 2–3x inference throughput over the previous. B200 processes roughly 2.5x more tokens per second than H100 for the same model. This is a key driver of the cost collapse from Chapter 1 — newer hardware makes each token cheaper to serve, which providers pass on (partially) as lower API prices.

Key insight: You don’t need to understand GPU architecture to use AI effectively. But understanding that each generation makes inference 2–3x cheaper explains why API prices keep falling and why locking into long-term GPU contracts can be risky.

attach_money

Manufacturing Economics

What GPUs actually cost to make vs what they sell for

Cost vs Price

// Manufacturing cost vs selling price H100 Manufacturing: $3,320 Selling price: $28,000 Gross margin: 88% H200 Manufacturing: $4,250 Selling price: $38,000 Gross margin: 89% B200 Manufacturing: $6,400 Selling price: $40,000–50,000 Gross margin: 84% GB200 (superchip) Manufacturing: $13,500 Selling price: $65,000 Gross margin: 79%

The Margin Story

NVIDIA operates at 79–89% gross margins on datacenter GPUs. A B200 costs ~$6,400 to manufacture and sells for $40,000–$50,000. The biggest cost driver is HBM memory, which accounts for 35–47% of manufacturing cost ($2,900 for B200’s HBM3e). Advanced packaging (CoWoS-L) adds another $1,100.

Key insight: NVIDIA’s margins explain why GPU cloud pricing has room to fall. Specialized cloud providers (Northflank, RunPod) can offer B200s at $5–9/hour vs hyperscalers at $14–19/hour by accepting lower margins and optimizing utilization.

cloud

Cloud GPU Pricing

Hyperscalers vs specialized clouds — the 3x gap

B200 Hourly Rates (March 2026)

// B200 cloud GPU pricing by provider Northflank $5.87/hr (specialized) RunPod $8.64/hr (specialized) AWS $14.24/hr (hyperscaler) GCP $18.53/hr (hyperscaler) // Monthly cost (24/7 usage) Northflank $4,226/month GCP $13,342/month // 3.2x difference for identical hardware

Hidden Cloud Costs

The hourly GPU rate is just the start. Egress fees (data transfer out) can add 20–30% for high-throughput inference. Storage premiums for model weights and data. Virtualization overhead reduces effective GPU performance by 5–15%. Networking for multi-GPU setups. Total hidden costs typically add 20–40% on top of the base GPU rate.

Key insight: GPU utilization is the critical metric. A GPU at 10% load costs 10x per token vs one at 100% load. Most self-hosted deployments run at 15–30% utilization, meaning you’re paying for 3–7x more GPU than you actually use.

compare

API vs Self-Hosting: The Break-Even

When does owning your own GPUs make sense?

The Comparison

API (Renting)

GPT-5 API: ~$168/month for moderate usage. Zero infrastructure. Instant scaling. Automatic model updates. No GPU expertise needed. Pay only for what you use.

Self-Hosted (Buying)

4x A100 cluster: ~$5,760/month cloud rental. Plus 0.25–1.0 FTE for operations ($37K–$150K/year). Model serving software. Monitoring. Security. Updates. Total: $8,800–18,260/month.

The Crossover Point

APIs win until you hit roughly 10 billion tokens/month (~500M tokens/day). Below that volume, the operational overhead of self-hosting exceeds the API cost savings. Above that volume, the per-token economics of dedicated GPUs start to win — but only if you can maintain >60% GPU utilization and have the engineering team to support it.

Key insight: Self-hosting hidden costs are 3–5x the raw GPU price. A $5,760/month GPU cluster actually costs $8,800–18,260/month when you add operations, monitoring, security, and model management. Most teams underestimate this dramatically.

shield

When Self-Hosting Wins

The four scenarios where owning GPUs makes sense

Scenario 1: Regulated Data

HIPAA, SOC 2, GDPR, data sovereignty. When regulations prohibit sending data to third-party APIs, self-hosting is the only option. Healthcare, finance, and government often require on-premises or private-cloud inference. The cost premium is a compliance cost, not a technology choice.

Scenario 2: Ultra-High Volume

>10B tokens/month with consistent, predictable demand. At this scale, dedicated GPUs at high utilization beat per-token API pricing. But the demand must be consistent — bursty workloads still favor APIs because idle GPUs cost the same as busy ones.

Scenario 3: Latency Requirements

Sub-100ms inference. API calls include network latency, queue time, and shared infrastructure overhead. Self-hosted models on dedicated GPUs can achieve 2–5x lower latency for latency-critical applications like real-time trading, gaming, or interactive voice.

Scenario 4: Custom Models

Fine-tuned or proprietary models that can’t run on API providers. If you’ve trained a custom model on proprietary data, you need your own inference infrastructure. This is increasingly common for domain-specific applications in legal, medical, and scientific fields.

Key insight: If none of these four scenarios apply to you, use APIs. The engineering complexity of self-hosting is substantial, and the cost savings only materialize at extreme scale with dedicated operations staff.

speed

GPU Utilization: The Hidden Metric

Why most self-hosted deployments waste 70–85% of their GPU capacity

The Utilization Problem

A GPU costs the same whether it’s processing tokens or sitting idle. Most self-hosted deployments run at 15–30% average utilization because demand is bursty — peak hours see high load, but nights and weekends are quiet. At 20% utilization, your effective cost per token is 5x higher than the theoretical maximum.

// Utilization impact on cost per token 100% utilization: $0.10/1K tokens 50% utilization: $0.20/1K tokens (2x) 20% utilization: $0.50/1K tokens (5x) 10% utilization: $1.00/1K tokens (10x) // At 10% utilization, APIs are almost // always cheaper

Improving Utilization

Techniques to push utilization higher: Request batching (group multiple requests into single GPU passes). Multi-model serving (run different models on the same GPU based on demand). Spot/preemptible instances (use cheap GPU time for non-urgent batch work). Auto-scaling (spin GPUs up/down with demand, though this has cold-start penalties).

Key insight: Before self-hosting, honestly assess your expected utilization. If you can’t sustain >60% utilization, APIs will be cheaper. The break-even math only works when GPUs are busy most of the time.

lightbulb

The Infrastructure Decision Tree

A simple framework for choosing your path

The Decision Tree

// Infrastructure decision framework Q1: Regulated data (HIPAA/SOC2)? Yes → Self-host (mandatory) No → Q2 Q2: >10B tokens/month consistently? Yes → Evaluate self-hosting No → Q3 Q3: Need sub-100ms latency? Yes → Evaluate self-hosting No → Q4 Q4: Custom/fine-tuned models? Yes → Self-host or specialized cloud No → Use APIs

What’s Next

Chapter 6 covers the optimization playbook — the water conservation analogy. Whether you use APIs or self-host, these techniques (caching, routing, compression, distillation, batching) can cut your bill by 60–70%. They work at every scale and every infrastructure choice.

Chapter Summary

APIs win for 87% of use cases. NVIDIA GPUs carry 79–89% margins, with B200 costing $6,400 to make and selling for $40K+. Cloud GPU pricing varies 3x between specialized and hyperscaler providers. Self-hosting hidden costs are 3–5x raw GPU price. The break-even is ~10B tokens/month at >60% utilization. Four scenarios justify self-hosting: regulated data, ultra-high volume, latency requirements, and custom models.

Ch 5 — The GPU & Infrastructure Layer