Ch 5 — The GPU & Infrastructure Layer

The real estate analogy — renting, buying, or building
High Level
home
Analogy
arrow_forward
memory
GPUs
arrow_forward
attach_money
Margins
arrow_forward
cloud
Cloud
arrow_forward
compare
API vs Self
arrow_forward
checklist
Decision
-
Click play or press Space to begin...
Step- / 8
home
The Real Estate Analogy
Renting an apartment vs buying a house vs building from scratch
Three Options
Renting (API) — Pay per use, no maintenance, move anytime. Like renting an apartment: you pay monthly, the landlord handles plumbing and repairs. Buying (Self-hosting) — Purchase GPUs, run your own models. Like buying a house: big upfront cost, you handle maintenance, but no monthly rent. Building (Custom infra) — Design custom hardware and software stacks. Like building a house from the ground up: maximum control, maximum complexity.
The Decision Framework
Most teams should rent (use APIs). APIs win for 87% of use cases because they eliminate infrastructure complexity, scale instantly, and require zero GPU expertise. Self-hosting only makes sense at extreme volume (>10B tokens/month) or with strict regulatory requirements (HIPAA, data sovereignty). Building custom infrastructure is reserved for hyperscalers and AI labs.
Key insight: The most common mistake is self-hosting too early. Teams underestimate the hidden costs (operations, monitoring, security, model updates) by 3–5x. Start with APIs, measure your actual volume, and only self-host when the math clearly justifies it.
memory
NVIDIA’s GPU Stack
H100, H200, B200, and the Vera Rubin roadmap
The GPU Lineup (2024–2026)
// NVIDIA datacenter GPU stack H100 (2023) 80GB HBM3 Workhorse H200 (2024) 141GB HBM3e Memory upgrade B200 (2025) 192GB HBM3e Blackwell arch GB200 (2025) 384GB combined Superchip Vera Rubin (2026+) Next gen
Why Memory Matters
GPU memory (HBM) determines how large a model can fit on a single chip. A 70B parameter model needs ~140GB in FP16, requiring 2 H100s but only 1 H200. Fewer GPUs = lower cost, simpler infrastructure, and better utilization. The memory race is as important as the compute race.
Performance Leaps
Each generation delivers 2–3x inference throughput over the previous. B200 processes roughly 2.5x more tokens per second than H100 for the same model. This is a key driver of the cost collapse from Chapter 1 — newer hardware makes each token cheaper to serve, which providers pass on (partially) as lower API prices.
Key insight: You don’t need to understand GPU architecture to use AI effectively. But understanding that each generation makes inference 2–3x cheaper explains why API prices keep falling and why locking into long-term GPU contracts can be risky.
attach_money
Manufacturing Economics
What GPUs actually cost to make vs what they sell for
Cost vs Price
// Manufacturing cost vs selling price H100 Manufacturing: $3,320 Selling price: $28,000 Gross margin: 88% H200 Manufacturing: $4,250 Selling price: $38,000 Gross margin: 89% B200 Manufacturing: $6,400 Selling price: $40,000–50,000 Gross margin: 84% GB200 (superchip) Manufacturing: $13,500 Selling price: $65,000 Gross margin: 79%
The Margin Story
NVIDIA operates at 79–89% gross margins on datacenter GPUs. A B200 costs ~$6,400 to manufacture and sells for $40,000–$50,000. The biggest cost driver is HBM memory, which accounts for 35–47% of manufacturing cost ($2,900 for B200’s HBM3e). Advanced packaging (CoWoS-L) adds another $1,100.
Key insight: NVIDIA’s margins explain why GPU cloud pricing has room to fall. Specialized cloud providers (Northflank, RunPod) can offer B200s at $5–9/hour vs hyperscalers at $14–19/hour by accepting lower margins and optimizing utilization.
cloud
Cloud GPU Pricing
Hyperscalers vs specialized clouds — the 3x gap
B200 Hourly Rates (March 2026)
// B200 cloud GPU pricing by provider Northflank $5.87/hr (specialized) RunPod $8.64/hr (specialized) AWS $14.24/hr (hyperscaler) GCP $18.53/hr (hyperscaler) // Monthly cost (24/7 usage) Northflank $4,226/month GCP $13,342/month // 3.2x difference for identical hardware
Hidden Cloud Costs
The hourly GPU rate is just the start. Egress fees (data transfer out) can add 20–30% for high-throughput inference. Storage premiums for model weights and data. Virtualization overhead reduces effective GPU performance by 5–15%. Networking for multi-GPU setups. Total hidden costs typically add 20–40% on top of the base GPU rate.
Key insight: GPU utilization is the critical metric. A GPU at 10% load costs 10x per token vs one at 100% load. Most self-hosted deployments run at 15–30% utilization, meaning you’re paying for 3–7x more GPU than you actually use.
compare
API vs Self-Hosting: The Break-Even
When does owning your own GPUs make sense?
The Comparison
API (Renting)
GPT-5 API: ~$168/month for moderate usage. Zero infrastructure. Instant scaling. Automatic model updates. No GPU expertise needed. Pay only for what you use.
Self-Hosted (Buying)
4x A100 cluster: ~$5,760/month cloud rental. Plus 0.25–1.0 FTE for operations ($37K–$150K/year). Model serving software. Monitoring. Security. Updates. Total: $8,800–18,260/month.
The Crossover Point
APIs win until you hit roughly 10 billion tokens/month (~500M tokens/day). Below that volume, the operational overhead of self-hosting exceeds the API cost savings. Above that volume, the per-token economics of dedicated GPUs start to win — but only if you can maintain >60% GPU utilization and have the engineering team to support it.
Key insight: Self-hosting hidden costs are 3–5x the raw GPU price. A $5,760/month GPU cluster actually costs $8,800–18,260/month when you add operations, monitoring, security, and model management. Most teams underestimate this dramatically.
shield
When Self-Hosting Wins
The four scenarios where owning GPUs makes sense
Scenario 1: Regulated Data
HIPAA, SOC 2, GDPR, data sovereignty. When regulations prohibit sending data to third-party APIs, self-hosting is the only option. Healthcare, finance, and government often require on-premises or private-cloud inference. The cost premium is a compliance cost, not a technology choice.
Scenario 2: Ultra-High Volume
>10B tokens/month with consistent, predictable demand. At this scale, dedicated GPUs at high utilization beat per-token API pricing. But the demand must be consistent — bursty workloads still favor APIs because idle GPUs cost the same as busy ones.
Scenario 3: Latency Requirements
Sub-100ms inference. API calls include network latency, queue time, and shared infrastructure overhead. Self-hosted models on dedicated GPUs can achieve 2–5x lower latency for latency-critical applications like real-time trading, gaming, or interactive voice.
Scenario 4: Custom Models
Fine-tuned or proprietary models that can’t run on API providers. If you’ve trained a custom model on proprietary data, you need your own inference infrastructure. This is increasingly common for domain-specific applications in legal, medical, and scientific fields.
Key insight: If none of these four scenarios apply to you, use APIs. The engineering complexity of self-hosting is substantial, and the cost savings only materialize at extreme scale with dedicated operations staff.
speed
GPU Utilization: The Hidden Metric
Why most self-hosted deployments waste 70–85% of their GPU capacity
The Utilization Problem
A GPU costs the same whether it’s processing tokens or sitting idle. Most self-hosted deployments run at 15–30% average utilization because demand is bursty — peak hours see high load, but nights and weekends are quiet. At 20% utilization, your effective cost per token is 5x higher than the theoretical maximum.
// Utilization impact on cost per token 100% utilization: $0.10/1K tokens 50% utilization: $0.20/1K tokens (2x) 20% utilization: $0.50/1K tokens (5x) 10% utilization: $1.00/1K tokens (10x) // At 10% utilization, APIs are almost // always cheaper
Improving Utilization
Techniques to push utilization higher: Request batching (group multiple requests into single GPU passes). Multi-model serving (run different models on the same GPU based on demand). Spot/preemptible instances (use cheap GPU time for non-urgent batch work). Auto-scaling (spin GPUs up/down with demand, though this has cold-start penalties).
Key insight: Before self-hosting, honestly assess your expected utilization. If you can’t sustain >60% utilization, APIs will be cheaper. The break-even math only works when GPUs are busy most of the time.
lightbulb
The Infrastructure Decision Tree
A simple framework for choosing your path
The Decision Tree
// Infrastructure decision framework Q1: Regulated data (HIPAA/SOC2)? Yes → Self-host (mandatory) No → Q2 Q2: >10B tokens/month consistently? Yes → Evaluate self-hosting No → Q3 Q3: Need sub-100ms latency? Yes → Evaluate self-hosting No → Q4 Q4: Custom/fine-tuned models? Yes → Self-host or specialized cloud No → Use APIs
What’s Next
Chapter 6 covers the optimization playbook — the water conservation analogy. Whether you use APIs or self-host, these techniques (caching, routing, compression, distillation, batching) can cut your bill by 60–70%. They work at every scale and every infrastructure choice.
Chapter Summary
APIs win for 87% of use cases. NVIDIA GPUs carry 79–89% margins, with B200 costing $6,400 to make and selling for $40K+. Cloud GPU pricing varies 3x between specialized and hyperscaler providers. Self-hosting hidden costs are 3–5x raw GPU price. The break-even is ~10B tokens/month at >60% utilization. Four scenarios justify self-hosting: regulated data, ultra-high volume, latency requirements, and custom models.