Ch 14: Future of AI Infrastructure

Ch 14 — The Future of AI Infrastructure

Co-packaged optics, chiplets, wafer-scale compute, photonics, sovereign AI, and what comes next

arrow_backIndex

Hands-On

trending_up

Scaling Laws

arrow_forward

fiber_smart_record

Optics

arrow_forward

view_module

Chiplets

arrow_forward

memory

Wafer-Scale

arrow_forward

lightbulb

Photonics

arrow_forward

flag

Sovereign AI

arrow_forward

hub

Edge AI

arrow_forward

rocket_launch

2030 Vision

Click play or press Space to begin...

Step- / 8

trending_up

Scaling Laws and Infrastructure Demand

Why AI infrastructure needs keep growing exponentially

The Insatiable Appetite

AI scaling laws show that model performance improves predictably with more compute, data, and parameters. This creates an exponential demand curve for infrastructure:

Training compute: Frontier model training compute has grown ~4× per year since 2020. GPT-3 (2020): ~3.6×10²³ FLOPS. GPT-4 (2023): ~2×10²&sup5; FLOPS. Next-gen (2026): estimated 10²&sup6;–10²&sup7; FLOPS.

Inference compute: Growing even faster than training. As AI becomes embedded in every product, inference demand scales with users × queries × model size. ChatGPT alone serves hundreds of millions of users.

The infrastructure response: Every 10× increase in compute demand requires new infrastructure paradigms. Current GPU clusters are hitting walls in networking, power, and cooling that incremental improvements can’t solve.

Infrastructure Scaling Timeline

Era Compute Cluster Bottleneck ────────────────────────────────────────────────── 2020 10²³ FLOPS ~1K GPUs GPU supply 2022 10²⁴ FLOPS ~10K GPUs Networking 2024 10²⁵ FLOPS ~100K GPUs Power/cooling 2026 10²⁶ FLOPS ~500K GPUs Grid capacity 2028 10²⁷ FLOPS ~1M+ GPUs Physics limits? Investment trajectory: 2023: ~$50B in AI infrastructure 2024: ~$100B 2025: ~$200B+ 2026: ~$300-500B (projected) # Each bottleneck spawns new technology: # GPU supply → custom chips (TPU, Trainium) # Networking → co-packaged optics, 1.6T # Power → liquid cooling, nuclear SMRs # Physics → wafer-scale, photonic compute

Key insight: AI infrastructure is on a trajectory similar to the early internet — exponential demand that outpaces every prediction. The companies and countries that invest in infrastructure today will have a structural advantage for the next decade. The bottleneck shifts every 2 years, but the demand curve never flattens.

fiber_smart_record

Co-Packaged Optics & 1.6T Networking

Moving data at the speed of light, directly from the chip

The Interconnect Wall

Current AI clusters use electrical signaling (copper cables) for short distances and pluggable optical transceivers for longer runs. Both are hitting limits:

Electrical: Copper cables lose signal quality beyond ~3 meters at 400G speeds. At 800G+, even 1-meter runs become unreliable. Power consumption scales quadratically with speed.

Pluggable optics: Transceivers sit at the edge of the switch, converting electrical signals to optical. Each conversion wastes energy (5–15W per 400G port) and adds latency. A 128-port switch with pluggable optics consumes 1–2 kW just for transceivers.

Co-Packaged Optics (CPO) integrates optical components directly into the switch/GPU package, eliminating the electrical-to-optical conversion at the board edge.

CPO and 1.6T Advances

Lightmatter Passage L200 (March 2025): 3D co-packaged optics Bandwidth: 32-64 Tbps per package Total I/O: 200+ Tbps per chip 5-10× improvement over existing CPO Lightmatter + Qualcomm (March 2026): Record 1.6 Tbps per fiber 16-wavelength DWDM architecture 8× more bandwidth than existing solutions NVIDIA Spectrum X / Quantum X Photonics: CPO integrated into switch ASICs TSMC SoIC with hybrid bonding Sub-10 μm pitch interconnects Impact on AI clusters: Current: 400G per port, ~5W per port CPO: 1.6T per port, ~2W per port → 4× bandwidth, 60% less power → Enables million-GPU clusters

Key insight: Co-packaged optics is like replacing the postal service with fiber-optic internet. Today’s pluggable transceivers are like mailboxes at the curb — you have to walk the letter from your desk to the mailbox. CPO puts the fiber-optic line directly on your desk. Less walking (power), faster delivery (bandwidth), and it scales to the whole neighborhood (cluster).

view_module

Chiplet Architectures: Beyond Monolithic Dies

Building bigger chips from smaller, specialized pieces

Why Chiplets?

Monolithic chips (one big die) are hitting manufacturing limits. The H100 die is 814 mm² — close to the maximum reticle size (~858 mm²) that lithography machines can pattern in one shot. Making dies bigger means lower yields (more defects) and exponentially higher costs.

Chiplets solve this by building a large chip from multiple smaller dies connected via advanced packaging:

1. Higher yield: Small dies have fewer defects. A defective chiplet is discarded; the rest are fine. A defect on a monolithic die wastes the entire chip.

2. Mix-and-match: Compute chiplets on cutting-edge 3nm, I/O chiplets on cheaper 7nm, memory on specialized process. Each component uses the optimal technology.

3. Scalability: Add more chiplets for more compute. AMD’s MI300X uses 13 chiplets (8 compute + 4 I/O + 1 memory controller) to create a 153B-transistor package.

Chiplet Examples in AI

AMD MI300X (2024): Chiplets: 13 (8 XCD + 4 IOD + 1) Transistors: 153 billion HBM3: 192 GB (8 stacks) Packaging: 2.5D CoWoS Intel Gaudi 3 (2024): Chiplets: 2 compute dies Connected via EMIB bridge Each die: separate power domain NVIDIA B200 (2025): Two GPU dies + Grace CPU Connected via NVLink-C2C Effective: 208B transistors Future (2027+): UCIe (Universal Chiplet Interconnect Express) Standard interface for mixing chiplets from different vendors → "LEGO blocks" for chip design

Key insight: Chiplets are to chips what containers are to software. Instead of one monolithic application (die), you build from modular, reusable components. Each chiplet can be developed, tested, and manufactured independently. The UCIe standard will eventually let you mix NVIDIA compute chiplets with Samsung memory chiplets — like plugging LEGO bricks from different sets together.

memory

Wafer-Scale Compute: One Chip to Rule Them All

Cerebras and the radical approach of using an entire silicon wafer as one chip

The Wafer-Scale Idea

Instead of cutting a wafer into hundreds of small chips, what if you used the entire wafer as one giant chip? That’s Cerebras’s approach.

Cerebras WSE-3 (2024):
• Die size: 46,225 mm² (entire 300mm wafer)
• Transistors: 4 trillion (vs 80B for H100)
• On-chip memory: 44 GB SRAM (vs 50 MB L2 on H100)
• Cores: 900,000
• Memory bandwidth: 21 PB/s (on-chip SRAM)

The key advantage: no off-chip communication. On a GPU cluster, most time is spent moving data between GPUs over NVLink/InfiniBand. On a wafer-scale chip, all data stays on-chip at SRAM speeds — 1,000× faster than HBM and 100,000× faster than network.

WSE-3 vs GPU Cluster

Metric WSE-3 H100 (8-GPU) ────────────────────────────────────────────── Transistors 4T 640B (8×80B) On-chip memory 44 GB SRAM 400 MB L2 Off-chip memory 1.5-12 TB* 640 GB HBM3 Memory BW 21 PB/s 26.8 TB/s Interconnect BW On-chip 7.2 TB/s NVLink Power ~23 kW ~10.2 kW * External MemoryX system Inference advantage: Llama 70B inference: 20× faster than GPU Entire model fits in on-chip SRAM No memory bandwidth bottleneck Cerebras for Nations (2025): Sovereign AI initiative Countries deploy WSE-3 clusters Models: Jais (Arabic), Nanda (Hindi) OpenAI partnership: $10B+, 750 MW

Key insight: Wafer-scale compute is the “what if we just made it really, really big?” approach to AI hardware. It sounds crazy, but it eliminates the #1 bottleneck in AI: moving data between chips. If your model fits on one wafer, you never pay the communication tax. The trade-off: you need a whole new ecosystem (cooling, packaging, software) to make it work.

lightbulb

Photonic Computing: Computing with Light

Using photons instead of electrons for matrix multiplication

Why Photonics?

Electronic computing faces fundamental limits: as transistors shrink, power density increases (hitting thermal walls), and interconnect resistance grows (slowing signals). Photonic computing uses light instead of electricity for computation, offering theoretical advantages:

Speed: Light travels at 3×10&sup8; m/s through optical waveguides with near-zero latency. No RC delay like copper wires.

Parallelism: Multiple wavelengths of light can travel through the same waveguide simultaneously (wavelength-division multiplexing). One fiber can carry 100+ independent data channels.

Energy: Photonic matrix multiplication can theoretically be done with near-zero energy — light passing through a configured optical element performs the multiply-accumulate “for free.”

Reality check: Photonic computing is still early-stage. Current systems handle specific operations (matrix multiply) but need electronic components for nonlinear operations, memory, and control logic.

Photonic AI Companies

Lightmatter: Passage: Photonic interconnect (CPO) Envise: Photonic compute chip Approach: Hybrid photonic-electronic Status: Production CPO (2025), compute R&D Luminous Computing: Full photonic AI accelerator Target: 10× energy efficiency vs GPU Status: Pre-production Intel (Integrated Photonics): Silicon photonics in data center switches Co-packaged with standard CMOS Status: Production (interconnect only) Timeline: 2025-2026: Photonic interconnects in production 2027-2028: Photonic accelerators in pilot 2029-2030: Photonic compute at scale (maybe) # Photonic interconnects are real and shipping. # Photonic compute is promising but 3-5 years # from production deployment.

Key insight: Photonic computing is at the stage where transistors were in the 1950s — the physics works, early prototypes exist, but the manufacturing ecosystem is immature. Photonic interconnects (moving data with light) are already in production and transformative. Photonic compute (doing math with light) is the holy grail but still 3–5 years from production. Watch this space.

flag

Sovereign AI: Nations Building Their Own

Why countries are investing billions in domestic AI infrastructure

The Sovereignty Imperative

AI infrastructure is becoming a matter of national security and economic competitiveness. Countries that depend on foreign cloud providers for AI face risks:

Data sovereignty: Sensitive government, healthcare, and financial data processed on foreign infrastructure is subject to foreign laws (e.g., U.S. CLOUD Act).

Supply chain risk: GPU export controls (U.S. restrictions on China) demonstrate that hardware access can be weaponized. Countries without domestic AI infrastructure are vulnerable.

Economic value: AI is projected to add $15.7 trillion to global GDP by 2030 (PwC). Countries without AI capability risk becoming digital colonies — consuming AI services built elsewhere, with value flowing out.

Cultural preservation: LLMs trained primarily on English data underperform in other languages. Sovereign AI enables training on local languages, legal systems, and cultural contexts.

Sovereign AI Initiatives

Cerebras for Nations (2025): WSE-3 clusters for sovereign AI Models: Jais (Arabic), Nanda (Hindi), SHERKALA (Kazakh), FLOR (Spanish-Catalan) EU AI Factories: €4B+ investment in AI supercomputers LUMI (Finland), Leonardo (Italy), MareNostrum 5 (Spain) Saudi Arabia (NEOM): $100B+ AI infrastructure investment Partnership with NVIDIA, Cerebras Target: Regional AI hub India (IndiaAI Mission): 10,000+ GPU national AI compute Focus: Hindi, regional language models Public-private partnership Japan (ABCI 3.0): Next-gen national supercomputer Focus: Japanese language AI, robotics Trend: AI infrastructure as national asset, like roads, ports, and power grids.

Key insight: Sovereign AI is the 21st-century equivalent of building national railways in the 19th century. Countries that built railways first industrialized first. Countries that build AI infrastructure first will lead the AI economy. The investment is massive ($1–100B per country), but the cost of not investing is permanent technological dependence.

hub

Edge AI & Distributed Inference

Moving AI compute closer to users for latency, privacy, and cost

Why Edge AI?

Not all AI needs a data center. Many applications benefit from running inference on-device or at the network edge:

Latency: A round trip to a cloud data center takes 20–100ms. On-device inference takes 5–20ms. For real-time applications (autonomous driving, AR/VR, voice assistants), this difference matters.

Privacy: Data never leaves the device. No cloud provider sees your queries. Critical for healthcare, legal, and personal AI assistants.

Cost: On-device inference is “free” after the hardware purchase. No per-token API costs. At billions of devices, this is transformative.

Offline capability: Works without internet. Essential for military, remote, and disaster-response applications.

Edge AI Hardware Landscape

Device Compute Memory Models ────────────────────────────────────────────────── Smartphone NPU ~45 TOPS 8-16 GB 1-3B params Apple M4 Pro ~38 TOPS 24-48 GB 7-13B params NVIDIA Jetson ~275 TOPS 32-64 GB 13-70B params Intel Core Ultra ~34 TOPS 16-64 GB 3-13B params Qualcomm X Elite ~45 TOPS 16-64 GB 3-13B params On-device model sizes (quantized): Llama 3 8B (Q4): ~4.5 GB → Phone/laptop Phi-3 Mini (Q4): ~2.3 GB → Phone Gemma 2 9B (Q4): ~5.5 GB → Laptop Llama 3 70B (Q4): ~40 GB → Workstation Trend: Every 2 years, the model that required a data center moves to a laptop, then a phone. GPT-3.5 level (2022 cloud) → phone (2025).

Key insight: Edge AI follows the same trajectory as computing itself: mainframe → PC → laptop → phone. Today’s cloud-only AI models will run on tomorrow’s phones. The infrastructure implication: the future isn’t just bigger data centers — it’s also billions of tiny AI computers in every device, with cloud serving as the “heavy lifting” tier for the largest models.

rocket_launch

The 2030 Vision: What AI Infrastructure Looks Like

Putting it all together — the data center of 2030

The 2030 AI Data Center

Compute: Next-gen GPUs at 2–5 nm, delivering 10–50× today’s FLOPS/watt. Chiplet architectures with UCIe allowing mix-and-match compute, memory, and I/O dies. Wafer-scale chips for specialized inference workloads.

Networking: Co-packaged optics at 1.6–3.2 Tbps per port. Photonic interconnects enabling flat, non-blocking fabrics across 100K+ GPUs. All-optical switching eliminating electronic bottlenecks in the network core.

Memory: HBM4E at 2+ TB/s per stack. Processing-in-memory (PIM) for attention operations. Persistent memory tiers blurring the line between storage and DRAM.

Cooling: Immersion cooling as the default. 200–350 kW per rack. PUE below 1.05. Zero-water closed-loop systems. Waste heat powering district heating.

Power: SMR nuclear providing carbon-free baseload. On-site renewable + battery for peak shaving. 500 MW+ campuses purpose-built for AI.

Technology Readiness Timeline

Technology Ready Impact ────────────────────────────────────────────────── 1.6T networking 2026 4× bandwidth HBM4 2026 2× memory BW Co-packaged optics 2026-27 60% less power Immersion cooling 2026-27 350 kW/rack UCIe chiplets 2027-28 Modular chips Nuclear SMRs 2028-30 Carbon-free power Photonic compute 2029-30 10× efficiency Wafer-scale mainstream 2028-30 No comm overhead What this means for practitioners: 2026: Learn liquid cooling, 800G networking 2027: Adopt CPO, chiplet-based accelerators 2028: Plan for SMR-powered facilities 2029: Evaluate photonic accelerators 2030: Operate heterogeneous compute fabrics

Key insight: The AI infrastructure of 2030 will be as different from today as today’s cloud is from 2010’s on-premise servers. Every layer of the stack — compute, memory, networking, cooling, power — is being reinvented simultaneously. The practitioners who understand these trends won’t just operate infrastructure; they’ll architect the systems that make the next generation of AI possible. That’s why you just spent 14 chapters learning this. Now go build.

arrow_backPrevious Chapter Back to Indexarrow_forward