Key Insights — AI Infrastructure

Foundation

Compute & Memory

Chapters 1-4

expand_more

1

Why CPUs Aren't Enough

AI workloads require massive parallel matrix multiplication, which CPUs are not designed for.

Sequential vs Parallel: CPUs are optimized for complex, sequential tasks (low latency). GPUs are optimized for simple, parallel tasks (high throughput).
The Matrix Math: Neural networks are essentially giant matrix multiplications. A GPU can do thousands of these simultaneously.

2

Inside a GPU

Modern GPUs are highly specialized factories built specifically for AI math.

Tensor Cores: Specialized silicon introduced by NVIDIA that performs entire matrix multiply-accumulate operations in a single clock cycle.
Streaming Multiprocessors (SMs): The core building blocks of a GPU that manage thousands of concurrent threads to hide memory latency.

3

The Accelerator Landscape

NVIDIA dominates, but alternatives are rising to challenge the monopoly.

NVIDIA's Moat: CUDA software ecosystem is the primary reason NVIDIA maintains its lead, not just hardware.
Custom Silicon: Google (TPU), AWS (Trainium/Inferentia), and AMD (MI300) offer competitive hardware, primarily for internal or specific cloud workloads.

4

The Memory Wall

AI is often constrained by how fast data can move, not how fast it can be computed.

HBM (High Bandwidth Memory): Stacked memory chips placed directly next to the GPU die to provide massive bandwidth (e.g., 3+ TB/s on H100).
The Bottleneck: Compute speed (TFLOPS) has grown much faster than memory bandwidth, making memory access the primary bottleneck in modern AI.

The Bottom Line: AI compute is fundamentally about parallel matrix math and memory bandwidth. GPUs, equipped with Tensor Cores and HBM, are the engines driving the AI revolution.

Networking

Interconnects & Topologies

Chapters 5-6

expand_more

5

Interconnects (NVLink & InfiniBand)

When one GPU isn't enough, the speed at which GPUs talk to each other becomes the bottleneck.

NVLink: NVIDIA's proprietary interconnect that allows GPUs within the same server to share memory at massive speeds (900 GB/s on H100).
InfiniBand vs Ethernet: InfiniBand offers ultra-low latency and lossless networking for connecting servers together, though RoCE (RDMA over Converged Ethernet) is catching up.

6

Network Topologies

How you wire thousands of GPUs together dictates how efficiently they can train a model.

Fat-Tree / Clos: Non-blocking network designs that ensure any GPU can talk to any other GPU at full bandwidth.
Rail Optimization: Ensuring that GPU 1 on Server A talks directly to GPU 1 on Server B to minimize network hops.

The Bottom Line: A supercomputer is only as fast as its network. High-speed interconnects like NVLink and InfiniBand are what turn 10,000 individual GPUs into a single massive AI brain.

Training

Distributed Training & Clusters

Chapters 7-8

expand_more

7

Distributed Training

Large models don't fit on a single GPU, so training must be split across hundreds or thousands of chips.

Data Parallelism: Copy the model to every GPU, split the data. (Standard for smaller models).
Tensor Parallelism: Split individual matrix multiplications across multiple GPUs. (Requires NVLink).
Pipeline Parallelism: Put different layers of the model on different GPUs.
3D Parallelism: Combining all three methods to train massive models like GPT-4.

8

Training Clusters

Building an AI supercomputer is an extreme engineering challenge involving power, storage, and fault tolerance.

Fault Tolerance: When training on 10,000 GPUs for months, hardware failures are guaranteed. Checkpointing and fast recovery are essential.
Storage Bottlenecks: GPUs consume data so fast that traditional storage arrays can't keep up, necessitating specialized parallel file systems.

The Bottom Line: Training frontier models requires orchestrating "3D Parallelism" across massive, highly-tuned clusters where hardware failures are a daily occurrence.

Serving

Inference & Data Pipelines

Chapters 9-10

expand_more

9

Inference Infrastructure

Serving models to users efficiently is a completely different engineering challenge than training them.

KV Cache: Storing previous token calculations in memory so the model doesn't have to recompute the entire prompt for every new word.
Continuous Batching: Dynamically swapping requests in and out of the GPU to maximize utilization, rather than waiting for all requests in a batch to finish.
vLLM & TensorRT-LLM: Specialized software engines designed to maximize inference throughput.

10

Storage & Data Pipelines

Data preparation is the hidden infrastructure cost of AI.

Data Ingestion: Moving petabytes of unstructured data (text, images, video) into scalable object storage.
Vector Databases: Specialized databases (like Pinecone or Milvus) designed to store and quickly search high-dimensional embeddings for RAG applications.

The Bottom Line: Inference economics dictate product viability. Techniques like KV caching and continuous batching are required to serve LLMs at scale without going bankrupt.

Ops

Power, Cloud & Orchestration

Chapters 11-14

expand_more

11

Power, Cooling & Energy

The physical limits of electricity and heat are the biggest constraints on AI scaling.

Power Density: AI racks consume 40-100+ kW, compared to 10-15 kW for traditional cloud servers.
Liquid Cooling: Air cooling is no longer sufficient for chips like the B200; direct-to-chip liquid cooling is becoming mandatory.

12

Cloud vs. On-Premises

The massive cost of GPUs is shifting the traditional "cloud-first" calculus.

Cloud: Best for elasticity, burst workloads, and avoiding massive upfront CapEx.
On-Prem / Colocation: For continuous, 24/7 training workloads, owning the hardware can be significantly cheaper over a 3-year lifespan.

13

Orchestration & Scheduling

Managing expensive GPU resources requires specialized scheduling software.

Kubernetes for AI: K8s has been adapted with device plugins to manage GPU workloads, though specialized schedulers (like Slurm) are still used in HPC.
Bin Packing: Fitting multiple smaller workloads onto a single GPU (using MIG - Multi-Instance GPU) to maximize utilization.

14

The Future of AI Infrastructure

The hardware landscape is evolving rapidly to support trillion-parameter models.

Silicon Photonics: Using light instead of electricity to transmit data between chips, drastically reducing power consumption and latency.
Nuclear Power: Tech giants are investing in SMRs (Small Modular Reactors) to secure the massive, clean energy required for future gigawatt data centers.

The Bottom Line: AI infrastructure is hitting physical limits. The next frontier isn't just better chips, but innovations in power generation, liquid cooling, and optical networking.