Ch 12: The GPU Revolution — The Engine Room of AI

Ch 12 — The GPU Revolution: The Engine Room of AI

How a chip designed for video games became the most strategically important technology in the world

Index

High Level

memory

CPU

arrow_forward

developer_board

GPU

arrow_forward

code

CUDA

arrow_forward

cloud

Cloud

arrow_forward

payments

Economics

arrow_forward

bolt

Future

Click play or press Space to begin...

Step- / 8

compare_arrows

CPU vs. GPU: Why It Matters

The difference between doing one thing fast and doing millions of things at once

The CPU Approach

A CPU (Central Processing Unit) is designed to handle complex, sequential tasks extremely well. It has a small number of powerful cores (typically 8–64) that execute instructions one after another at very high speed. It’s a brilliant generalist — great at running your operating system, executing business logic, and handling diverse workloads. But when you need to perform the same mathematical operation on millions of data points simultaneously, it becomes a bottleneck.

The GPU Approach

A GPU (Graphics Processing Unit) takes the opposite approach. Instead of a few powerful cores, it has thousands of smaller cores (16,000+ on modern chips) that work in parallel. Originally designed to render millions of pixels on a screen simultaneously for video games, GPUs turned out to be perfectly suited for neural network training — which is fundamentally millions of simple math operations performed in parallel.

The Mental Model

Imagine you need to grade 10,000 multiple-choice exams. A CPU is like one brilliant professor who grades each exam thoroughly, one at a time. A GPU is like 10,000 teaching assistants who each grade one exam simultaneously. For complex, sequential reasoning, the professor is better. For massive parallel tasks, the army of assistants finishes in a fraction of the time.

Key insight: Neural network training is inherently parallel — millions of weights being updated simultaneously, millions of data points being processed in batches. This is why GPUs accelerated AI training by 10–100× compared to CPUs, and why the deep learning revolution of 2012 coincided with researchers discovering they could repurpose gaming GPUs for AI.

developer_board

NVIDIA’s Dominance

How a gaming company became the most valuable in the world

The Numbers

NVIDIA controls 80–90% of the AI GPU market, with some estimates reaching 95% for training accelerators. In fiscal year 2026, NVIDIA’s data center revenue hit $112 billion — 87% of its total $130 billion revenue. The company’s market capitalization has surpassed $3 trillion, making it one of the most valuable companies in history. All of this from a company that started making chips for video games.

The Product Line

H100 — The workhorse of AI training in 2023–2024. Manufacturing cost: ~$3,320. Selling price: ~$28,000. That’s an 88% gross margin — reflecting the extreme demand and limited alternatives.

B200 (Blackwell) — The current generation. 5× inference performance over H100. Manufacturing cost: ~$6,400. Selling price: ~$40,000. Sold out through mid-2026 with a 3.6 million unit backlog. Generated $11 billion in its first quarter alone.

Why the Monopoly Persists

NVIDIA’s dominance isn’t just about hardware. It’s about the ecosystem:

CUDA — NVIDIA’s proprietary software platform, used by 98% of AI developers. Every major AI framework (PyTorch, TensorFlow) is optimized for CUDA. Switching away means rewriting code.
Libraries — cuDNN, TensorRT, and other optimized libraries that make NVIDIA GPUs faster in practice, not just in specs.
Network effects — Researchers publish code that runs on NVIDIA GPUs. New researchers use NVIDIA to reproduce results. The cycle reinforces itself.

Key insight: NVIDIA’s moat is software, not hardware. AMD and others can build competitive chips, but the CUDA ecosystem creates switching costs that keep developers locked in. For executives, this means AI infrastructure decisions have long-term vendor implications. Choosing a GPU platform is choosing a software ecosystem.

payments

The Economics of AI Training

What it actually costs to build a frontier model

Training Cost by Model Size

The cost of training an AI model scales dramatically with size:

7 billion parameters — $50,000–$500,000
70 billion parameters — $1.2M–$6M (down 45% from 2024 due to newer, more efficient GPUs)
175+ billion parameters — $25M–$120M

GPT-4’s training reportedly cost around $100 million. Meta’s Llama 3 cost approximately $25 million. DeepSeek’s efficient approach brought costs down to ~$5.6 million for a competitive model.

Training vs. Inference

Training is a one-time cost (though retraining happens regularly). Inference — actually running the model to make predictions — is the ongoing cost that scales with usage. For large language models, inference can cost more than training over the model’s lifetime because it runs continuously for millions of users. This is why inference efficiency (how fast and cheaply a model can respond) is becoming more important than training efficiency.

Key insight: Training costs are falling rapidly — 45% reduction in just one year for 70B models. But the cost of running AI at scale (inference) is the number that matters for enterprise P&L. When evaluating AI solutions, ask about the per-query or per-transaction cost, not just the upfront training investment.

cloud

Cloud AI Infrastructure

AWS, Azure, GCP — and the specialized alternatives

The Hyperscalers

Most organizations access GPU compute through cloud providers rather than buying hardware:

AWS — H100 instances at ~$6.88/hr on-demand, reducible to ~$2.97/hr with 3-year reserved instances. Strong SageMaker integration for ML workflows.
Azure — Lowest on-demand rates among hyperscalers. Deep enterprise integration with Microsoft 365 and OpenAI partnership.
Google Cloud — Premium pricing but offers committed-use discounts and proprietary TPU chips as an alternative to GPUs.

Specialized GPU Clouds

A growing category of providers (CoreWeave, Lambda, CUDO Compute) focuses exclusively on AI workloads, offering 30–70% lower costs than hyperscalers. They achieve this by eliminating the overhead of general-purpose cloud services and specializing in GPU scheduling and optimization. The tradeoff: fewer enterprise features, less geographic coverage, and smaller ecosystems.

Cost Optimization

Spot/preemptible instances — Use idle GPU capacity at 60–80% discount, with the risk of interruption.
Mixed-precision training — Use lower numerical precision where full precision isn’t needed, cutting compute by 30–50%.
Efficient data pipelines — Ensure GPUs aren’t sitting idle waiting for data.

Combined, these strategies can reduce AI compute costs by 30–60%.

Key insight: The cloud vs. on-premises decision for AI is not the same as for traditional IT. AI workloads are “bursty” — you need massive compute for training (weeks), then much less for inference (ongoing). Cloud provides the flexibility to scale up for training and scale down after. On-premises makes sense only at very high, sustained utilization rates.

bolt

Beyond NVIDIA: The Competitive Landscape

AMD, custom silicon, and the race to break the monopoly

AMD

AMD holds approximately 5–8% of the AI accelerator market with its MI300X chip. Competitive on raw performance specs and significantly cheaper than NVIDIA equivalents. The challenge: CUDA compatibility. AMD’s ROCm software ecosystem is improving but still lags in library support and developer adoption. For organizations willing to invest in porting, AMD offers a cost-effective alternative.

Custom Silicon

Google TPUs (Tensor Processing Units) — Purpose-built for AI, available only through Google Cloud. Competitive for specific workloads, especially Transformer training.
Amazon Trainium/Inferentia — AWS’s custom chips for training and inference, offering significant cost savings for workloads within the AWS ecosystem.
Apple Silicon — M-series chips with integrated neural engines, enabling on-device AI for consumer applications.

The Startup Wave

Groq — Designed for inference speed, achieving record-breaking tokens-per-second for LLM serving.
Cerebras — Builds wafer-scale chips (the size of a dinner plate) for training, eliminating communication bottlenecks between chips.
SambaNova — Reconfigurable dataflow architecture optimized for enterprise AI workloads.

These startups target specific niches where NVIDIA’s general-purpose approach leaves room for optimization.

Key insight: The AI chip market is a $160 billion opportunity in 2025. While NVIDIA dominates today, the competitive landscape is intensifying. For enterprise leaders, the strategic question is: how much vendor lock-in are you willing to accept for NVIDIA’s ecosystem advantages? Diversifying across platforms reduces risk but increases complexity.

electric_bolt

The Energy Question

AI’s growing appetite for power

The Scale of the Problem

Training GPT-4 consumed an estimated 50 GWh of electricity — roughly the annual consumption of 4,600 US homes. A single NVIDIA H100 GPU draws 700 watts under full load. A large AI training cluster with 10,000 GPUs consumes 7 megawatts continuously — enough to power a small town. And that’s before cooling, networking, and storage.

Data Center Expansion

AI is driving an unprecedented wave of data center construction. Hyperscalers are investing tens of billions in new facilities, often in locations chosen for access to cheap, renewable energy. Microsoft, Google, and Amazon have all signed long-term power purchase agreements for nuclear and renewable energy specifically to power AI workloads. Some are even investing in next-generation nuclear reactors.

Efficiency Improvements

The industry is responding on multiple fronts:
Hardware efficiency — Each GPU generation delivers 2–3× more performance per watt. Blackwell is significantly more energy-efficient than Hopper.
Model efficiency — Techniques like quantization, distillation, and sparse architectures reduce compute requirements by 50–90% with minimal accuracy loss.
Inference optimization — Serving a model efficiently requires far less energy than training it. Optimized inference engines can reduce per-query energy by 10×.

Key insight: Energy consumption is becoming a board-level concern for AI. It affects operating costs, sustainability commitments, and regulatory compliance. When evaluating AI infrastructure, energy efficiency should be a first-order consideration, not an afterthought. The most efficient approach is often the most cost-effective one.

speed

Training vs. Inference: Two Different Problems

Building the model vs. running the model — different hardware, different economics

Training

Goal: Process the entire training dataset, adjust billions of weights, minimize error. Done once (or periodically for retraining).
Hardware needs: Maximum compute throughput. Large GPU clusters (hundreds to thousands of GPUs) connected by high-speed networking.
Duration: Days to months for large models.
Cost profile: Large upfront investment, then done. GPT-4: ~$100M. A fine-tuned enterprise model: $50K–$500K.

Inference

Goal: Respond to individual requests as fast and cheaply as possible. Runs continuously, 24/7.
Hardware needs: Low latency, high throughput, cost efficiency. Smaller GPUs or specialized inference chips often suffice.
Duration: Milliseconds to seconds per request, but millions of requests per day.
Cost profile: Ongoing, scales with usage. Often exceeds training cost over the model’s lifetime.

Key insight: Most enterprise AI spending will be on inference, not training. You’ll likely use a pre-trained foundation model (or fine-tune one), then run it continuously for your users. The infrastructure decision should be optimized for inference economics — cost per query, latency per response, throughput per dollar. This is where specialized inference chips (Groq, AWS Inferentia) offer compelling alternatives to NVIDIA.

psychology

The Infrastructure Decision Framework

What every executive needs to know about AI compute

The Strategic Questions

1. Build or consume? — Are you training custom models (need GPU infrastructure) or using pre-built AI services (API calls, no GPUs needed)?

2. Cloud or on-premises? — Cloud for bursty workloads and experimentation. On-premises only if sustained utilization exceeds 70%+ and you have the operational expertise.

3. Which ecosystem? — NVIDIA/CUDA for maximum flexibility and ecosystem support. Hyperscaler custom silicon (TPU, Trainium) for cost savings within that cloud. AMD for budget-conscious workloads willing to accept ecosystem tradeoffs.

The Cost Hierarchy

From most to least expensive per AI capability:

1. Train from scratch — $50K–$100M+. Only justified for frontier models or highly proprietary data.
2. Fine-tune a foundation model — $1K–$50K. The sweet spot for most enterprises.
3. Use a pre-trained model via API — Pay per query. No infrastructure needed. Fastest time to value.
4. Use an open-source model — Free model, pay only for hosting. Maximum control, moderate complexity.

The bottom line: Compute is the third pillar of AI (alongside data and algorithms). NVIDIA dominates today with 80–90% market share and $112B in data center revenue. But the landscape is diversifying. For most enterprises, the right answer is cloud-based, API-first, with GPU infrastructure only for workloads that justify it. Don’t buy GPUs until you’ve exhausted what APIs can do.

arrow_back Ch 11: NLP Evolution Ch 13: The Transformer arrow_forward