Ch 3: The Accelerator Zoo — AI Infrastructure

Ch 3 — The Accelerator Zoo: GPUs, TPUs, and Beyond

NVIDIA, AMD, Google, AWS, Intel — who makes what, and when to pick which

arrow_backIndex

Foundation

developer_board

NVIDIA

arrow_forward

memory

AMD

arrow_forward

cloud

Google TPU

arrow_forward

dns

AWS

arrow_forward

bolt

Intel

arrow_forward

table_chart

Comparison

arrow_forward

lock

Ecosystem

arrow_forward

decision

When to Pick

Click play or press Space to begin...

Step- / 8

developer_board

NVIDIA: The Undisputed Leader

~80% market share in AI accelerators, powered by CUDA ecosystem lock-in

The NVIDIA Lineup

NVIDIA dominates AI infrastructure with a lineup spanning every price point and use case:

H100 (Hopper, 2022): The workhorse. 80 GB HBM3, 3,350 GB/s bandwidth, 990 TFLOPS FP16. Trained GPT-4, Llama 3, Claude 3. Still the most widely deployed AI GPU. 700W TDP, air-coolable.

H200 (Hopper refresh, 2023): Same compute as H100 but upgraded to 141 GB HBM3e with 4,800 GB/s bandwidth. A drop-in upgrade for memory-bound inference workloads. 700W TDP.

B200 (Blackwell, 2024): The new flagship. 192 GB HBM3e, 8,000 GB/s bandwidth, ~2,250 TFLOPS FP16. Dual-die chiplet design with 208 billion transistors. Requires liquid cooling at 1,000W.

GB200 NVL72: The rack-scale monster. 72 B200 GPUs + 36 Grace CPUs in a single rack with 130 TB/s NVLink bandwidth. 30x faster inference than H100.

NVIDIA Specs at a Glance

H100 H200 B200 HBM: 80GB 141GB 192GB BW: 3,350 4,800 8,000 GB/s FP16 TF: 990 990 ~2,250 TFLOPS FP8 TF: 1,979 1,979 ~4,500 TFLOPS NVLink: 900 900 1,800 GB/s TDP: 700W 700W 1,000W Cooling: Air Air Liquid Price (estimated): H100: $25,000-30,000 H200: $30,000-35,000 B200: $35,000-50,000 B200 is 2.5x faster at Llama 3 70B training than H100, reducing epoch time from ~100 to ~40 hours.

Key insight: NVIDIA’s real moat isn’t just hardware — it’s CUDA. Nearly every AI framework (PyTorch, TensorFlow, JAX) is optimized for CUDA first. Every researcher learns CUDA. Every library targets NVIDIA GPUs. This ecosystem lock-in is worth more than any single chip advantage.

memory

AMD Instinct: The Challenger

Competitive hardware, growing software ecosystem with ROCm

AMD’s AI Accelerator Lineup

MI300X (CDNA 3, Dec 2023): AMD’s current-gen AI GPU. 192 GB HBM3 with 5,300 GB/s bandwidth — more memory and bandwidth than the H100. 2.61 PFLOPS FP8. 750W TDP. Used by Microsoft Azure, Oracle Cloud, and Meta for inference workloads.

MI350X (CDNA 4, June 2025): Major generational leap. 288 GB HBM3e with 8,000 GB/s bandwidth. 9.2 PFLOPS at MXFP4, 4.6 PFLOPS FP8. Up to 35x inference improvement over MI300X. 1,000W TDP, requires liquid cooling.

MI355X (2025): Higher-clocked variant with direct liquid cooling. 10.1 PFLOPS MXFP4. 1,400W peak power.

MI400 (announced): Next-generation “Helios” rack design. Details pending.

AMD’s advantage: more memory per chip. The MI300X’s 192 GB lets it run a 70B FP16 model on a single GPU — impossible on an 80 GB H100.

AMD vs NVIDIA Comparison

MI300X MI350X H100 HBM: 192GB 288GB 80GB BW: 5,300 8,000 3,350 GB/s FP8: 2,610 4,600 1,979 TFLOPS FP16: 1,300 2,300 990 TFLOPS TDP: 750W 1,000W 700W Software stack: NVIDIA: CUDA (mature, universal) AMD: ROCm (improving, gaps remain) Key ROCm-supported frameworks: PyTorch: ✓ (good support) JAX: ✓ (improving) TensorFlow: ✓ (basic) vLLM: ✓ (inference) AMD hardware often wins on paper. The software gap is closing but still NVIDIA's biggest advantage.

Key insight: AMD offers more memory per dollar and competitive raw FLOPS. For inference workloads where the model just needs to fit in memory and generate tokens, AMD can be a cost-effective choice. For training, where the CUDA ecosystem and NVLink interconnects matter more, NVIDIA still leads.

cloud

Google TPU: Purpose-Built for AI

Custom silicon designed from scratch for matrix operations — available only on Google Cloud

TPU Architecture

Google’s Tensor Processing Units (TPUs) take a fundamentally different approach. Instead of adapting a graphics chip for AI (like NVIDIA), Google designed a chip exclusively for matrix multiplication from the ground up.

TPU v5p (2023): 459 TFLOPS BF16/FP8 per chip, 95 GB HBM, 2,765 GB/s bandwidth. Scales to 8,960 chips in a single pod using a 3D torus interconnect topology.

Trillium / TPU v6 (2024): 4.7x peak compute vs v5e, double the HBM capacity and bandwidth, double the inter-chip interconnect (ICI) bandwidth. Scales to 256 chips per pod, with 100,000+ chips deployable on Google’s Jupiter network fabric (13 Petabits/sec bisectional bandwidth).

Ironwood / TPU v7 (2025): Seventh generation, preview since Nov 2025. Designed for large-scale training and inference of LLMs, MoE models, and diffusion models.

TPU Unique Advantages

TPU v5p specs: Compute: 459 TFLOPS (BF16/FP8) HBM: 95 GB per chip BW: 2,765 GB/s Max pod: 8,960 chips Topology: 3D torus Trillium / v6 improvements: Compute: 4.7x vs v5e HBM: 2x capacity ICI BW: 2x Efficiency: 67% better energy/FLOP Training: 4x faster Inference: 3x throughput Pricing (v5e): ~$1.20/hr per chip 30-40% cheaper than H100 for compatible workloads TPUs require JAX or TensorFlow. PyTorch support exists but is less mature than native CUDA.

Key insight: TPUs excel at large-scale training where you can use Google’s software stack (JAX/TensorFlow). Google trains all its own models (Gemini, PaLM) on TPUs. The 3D torus interconnect gives TPU pods uniquely efficient all-to-all communication. But you’re locked into Google Cloud — you can’t buy TPUs for your own data center.

dns

AWS Trainium & Inferentia

Amazon’s custom chips — designed for cost-effective training and inference on AWS

AWS Custom Silicon

Amazon designed its own AI chips to reduce dependence on NVIDIA and offer lower-cost options to AWS customers:

Trainium (training): Available in Trn1 instances. Up to 16 chips per instance. ~$1.34/hr per chip — significantly cheaper than H100 cloud pricing. 2x better power efficiency than A100. Requires the AWS Neuron SDK for model compilation.

Trainium2 (2024+): Next generation. AWS’s Project Rainier deployment uses ~500,000 Trainium2 chips across a 1,200-acre facility — one of the largest AI clusters ever built.

Inferentia2 (inference): Optimized for inference workloads. Lower cost per inference than GPU instances for supported models. Up to 190 TOPS INT8.

The trade-off: Trainium requires porting your code to the Neuron SDK. Not all models and operations are supported. The ecosystem is much smaller than CUDA.

AWS vs Others: Cost Comparison

Training 1B tokens (estimated): AWS Trainium: ~$10,000 Google TPU v5e: ~$8,000 Azure H100: ~$15,000 AWS Trainium advantages: ✓ 2x power efficiency vs A100 ✓ Tight AWS integration ✓ Lower per-chip cost ✓ Massive scale (Project Rainier) AWS Trainium limitations: ✗ Neuron SDK required ✗ Not all ops supported ✗ Smaller community ✗ AWS-only (no on-prem) ✗ Debugging tools less mature Best for: Organizations already deep in AWS that want to reduce GPU costs for supported models.

Key insight: AWS Trainium is a bet on vertical integration — Amazon controls the chip, the cloud, the SDK, and the pricing. For organizations running large-scale training on AWS, the cost savings can be 30–50% vs H100 instances. But the smaller ecosystem means more engineering effort to port and optimize models.

bolt

Intel Gaudi 3: The Open Alternative

Standard Ethernet networking, open-source software, competitive inference performance

Gaudi 3 Architecture

Intel’s Gaudi 3 (acquired from Habana Labs) takes a different approach to AI acceleration:

Memory: 128 GB HBM with 3,700 GB/s bandwidth. More than H100’s 80 GB, less than MI300X’s 192 GB.

Compute: 64 Tensor Processor Cores with 8 Matrix Math Engines. Competitive FP8 inference performance with H100.

Networking: 24x 200GbE standard Ethernet ports (RoCE). This is the key differentiator — no proprietary InfiniBand required. 33% more I/O throughput than H100.

Software: Open-source stack. Migration from PyTorch requires ~3 lines of code changes (using Intel’s Habana bridge). No CUDA licensing fees.

Power: 900W TDP. Between H100 (700W) and B200 (1,000W).

Gaudi 3 Inference Performance

Llama 3.1 inference (FP8): 8B model (1 accelerator): 128/128 tokens: 24,535 tok/s 70B model (8 accelerators): 128/2048 tokens: 21,448 tok/s Gaudi 3 vs H100 comparison: Inference perf: Competitive Memory: 128GB vs 80GB Networking: Standard Ethernet Software: Open-source Price: ~30% lower Key advantage: No InfiniBand lock-in No CUDA licensing Standard Ethernet = cheaper networking infrastructure Gaudi 3 is positioned as the "open" alternative to NVIDIA's proprietary stack.

Key insight: Gaudi 3’s use of standard Ethernet instead of InfiniBand is strategically important. InfiniBand switches and cables are expensive and controlled by NVIDIA (via Mellanox). Gaudi’s Ethernet approach means you can use commodity networking gear, potentially saving 20–30% on cluster networking costs.

table_chart

The Big Comparison

All major accelerators side by side — specs, price, and trade-offs

Hardware Specs Comparison

Chip HBM BW FP8 TDP (GB) (GB/s) (TFLOPS) (W) H100 80 3,350 1,979 700 H200 141 4,800 1,979 700 B200 192 8,000 ~4,500 1,000 MI300X 192 5,300 2,610 750 MI350X 288 8,000 4,600 1,000 Gaudi 3 128 3,700 ~1,800 900 TPU v5p 95 2,765 459* ~400 * TPU v5p figure is BF16/FP8 combined; not directly comparable to GPU FP8 TFLOPS.

Cloud Pricing (per GPU-hour)

On-demand pricing (approx.): H100 (AWS): $4.89-6.88/hr H100 (Azure): $5.50-12.84/hr H100 (GCP): $3.00-4.00/hr H100 (CoreWeave): $2.00-4.25/hr TPU v5e (GCP): $1.20/hr Trainium (AWS): $1.34/hr Gaudi 3 (AWS): $~3.50/hr Spot/preemptible discounts: 50-75% off on-demand Reserved (1-3 year): 30-50% off on-demand Hidden costs (storage, egress, networking) add 20-40% to advertised GPU-hour rates.

Key insight: Raw TFLOPS don’t tell the whole story. Real-world performance depends on software optimization, memory capacity (can the model fit?), memory bandwidth (can you feed the cores?), and interconnect speed (can GPUs communicate fast enough?). A chip with lower TFLOPS but more memory might be faster for your specific workload.

lock

The Ecosystem Factor

Hardware specs matter less than software support and community

Software Ecosystem Maturity

The accelerator you choose determines your entire software stack. This is the most underappreciated factor in hardware selection:

NVIDIA CUDA: Universal support. PyTorch, TensorFlow, JAX, every inference engine (vLLM, TGI, TensorRT-LLM), every training framework (DeepSpeed, Megatron, FSDP). 15+ years of optimization. Millions of developers. If something works in AI, it works on CUDA first.

AMD ROCm: Growing fast. PyTorch has good support. vLLM works. But many niche libraries, custom CUDA kernels, and cutting-edge research code don’t support ROCm yet. Debugging tools are less mature.

Google TPU (JAX/XLA): Excellent for JAX-native code. Google’s own models are TPU-optimized. But PyTorch on TPU (via torch_xla) has friction.

AWS Neuron: Smallest ecosystem. Requires explicit model compilation. Limited operator support.

Framework Support Matrix

NVIDIA CUDA

PyTorch: Native, first-class
JAX: Full support
vLLM: Primary target
DeepSpeed: Full support
FlashAttention: Native
Custom kernels: Universal

Everyone Else

AMD ROCm: PyTorch good, gaps in niche libs
TPU/JAX: JAX excellent, PyTorch friction
Trainium: Limited ops, Neuron SDK required
Gaudi: PyTorch bridge, growing support
Custom kernels: Must be rewritten

Key insight: Choosing an accelerator is like choosing a programming language — the ecosystem matters more than the syntax. NVIDIA’s CUDA ecosystem is the “English” of AI compute: not necessarily the best in every dimension, but everyone speaks it. Switching away means rewriting kernels, retraining engineers, and accepting that some tools won’t work.

decision

When to Pick What

A practical decision framework based on your workload, budget, and constraints

Decision Framework

Pick NVIDIA H100/B200 when: ✓ Training large models (>7B) ✓ Need cutting-edge frameworks ✓ Team knows CUDA ✓ Multi-GPU training required ✓ Budget allows premium pricing Pick AMD MI300X/MI350X when: ✓ Inference-heavy workloads ✓ Need max memory per GPU ✓ Cost-sensitive deployment ✓ Willing to work with ROCm ✓ Single-GPU large model serving Pick Google TPU when: ✓ Already on Google Cloud ✓ Team uses JAX/TensorFlow ✓ Large-scale training ✓ Want best price/performance ✓ Don't need on-prem option Pick AWS Trainium when: ✓ Deep in AWS ecosystem ✓ Cost is primary concern ✓ Supported model architectures ✓ Large-scale training on AWS

The Practical Reality

For most organizations today: NVIDIA is the safe choice. The ecosystem advantage is real and significant. You’ll spend less time debugging, find more tutorials, and hire engineers who already know the stack.

For cost-optimized inference: AMD MI300X or Google TPU can save 20–40% with competitive performance. If your workload is inference-heavy and you have engineering capacity to handle the software differences, the savings are meaningful.

For maximum scale: Google TPU pods and AWS Trainium clusters offer the largest single-tenant deployments. Project Rainier (500K Trainium2 chips) and Google’s TPU pods (100K+ chips on Jupiter fabric) are unmatched in scale.

The trend: Competition is increasing. AMD’s ROCm is improving rapidly. Google’s TPUs keep getting better. The NVIDIA premium will likely shrink over the next 2–3 years, but CUDA’s ecosystem advantage will persist.

Key insight: The accelerator market is evolving from “NVIDIA or nothing” to a multi-vendor landscape. But switching costs are high. The best strategy for most teams: start with NVIDIA for flexibility, evaluate alternatives for specific high-volume workloads where cost savings justify the engineering investment.

arrow_back Ch 2: Inside a GPU Ch 4: Memory — The Real Bottleneck arrow_forward