Ch 12: Cloud vs On-Prem — AI Infrastructure

Ch 12 — Cloud vs On-Prem: The Build-or-Rent Decision

TCO analysis, cloud providers, pricing models, hidden costs, and hybrid strategies

arrow_backIndex

Hands-On

balance

Trade-offs

arrow_forward

cloud

Providers

arrow_forward

sell

Pricing

arrow_forward

visibility_off

Hidden Costs

arrow_forward

home_work

On-Prem

arrow_forward

calculate

TCO Math

arrow_forward

merge

Hybrid

arrow_forward

decision

Framework

Click play or press Space to begin...

Step- / 8

balance

The Fundamental Trade-off

Flexibility and speed vs cost and control

Cloud Advantages

Speed to deploy: Spin up 100 GPUs in minutes. No hardware procurement (3–6 month lead times), no data center build-out (12–24 months), no hiring infrastructure engineers.

Elasticity: Scale up for training runs, scale down when done. Pay only for what you use. A 2-week training run on 1,000 GPUs doesn’t require owning 1,000 GPUs year-round.

Managed services: Networking, storage, monitoring, security — all handled by the provider. Your team focuses on ML, not infrastructure.

Latest hardware: Access to newest GPUs (H200, B200) without capital commitment. Upgrade by changing an instance type, not replacing hardware.

On-Prem Advantages

Cost at scale: Self-hosted inference can be 18× cheaper than cloud APIs over 3 years (Lenovo 2026 analysis). Break-even arrives in 3–6 months of production usage.

Control: Full control over hardware configuration, network topology, security policies, and data residency. No vendor lock-in.

Data sovereignty: Regulated industries (healthcare, finance, government) may require data to stay on-premises. GDPR, HIPAA, and sector-specific regulations often mandate physical data control.

Predictable costs: After initial CAPEX, marginal cost per inference approaches zero — dominated only by electricity and maintenance. No surprise bills from egress fees or API rate changes.

Key insight: Cloud vs on-prem is like renting vs buying a house. Renting is flexible — move anytime, no maintenance headaches. Buying is cheaper long-term but requires a down payment, maintenance, and commitment. Most organizations should start renting (cloud) and buy (on-prem) only when they know exactly what they need.

cloud

Cloud Provider Landscape

Hyperscalers, GPU clouds, and marketplaces — each with different strengths

Hyperscalers

AWS — Largest GPU fleet. P5 instances (H100), P5e (H200), custom Trainium chips. Best ecosystem (SageMaker, EKS, S3). Highest on-demand prices but deepest reserved discounts. InfiniBand networking via EFA.

Google Cloud (GCP) — TPU v5p/Trillium for training, H100/A3 instances for inference. Best for JAX/TensorFlow workloads. Competitive pricing. Strong Kubernetes (GKE) integration.

Microsoft Azure — Tight OpenAI partnership. NC H100 v5 instances. Azure ML managed service. Strong enterprise integration (Active Directory, compliance). ND H200 v5 for latest hardware.

GPU Clouds & Marketplaces

CoreWeave — GPU-native cloud. H100 at $4.25–4.76/hr. InfiniBand networking. Kubernetes-first. Strong for training workloads. Backed by NVIDIA.

Lambda Labs — Developer-friendly. H100 at $2.99/hr. Simple pricing, no hidden fees. Good for research and small teams.

Vast.ai / RunPod — GPU marketplaces. H100 from $1.49–1.99/hr. Community GPUs (less reliable) and secure cloud options. Best for cost-sensitive batch workloads.

Cudo Compute — Distributed GPU marketplace. H100 at ~$1.80/hr. Aggregates capacity from multiple providers.

H100 Pricing Comparison (2026)

Provider On-Demand Reserved(1yr) Spot ────────────────────────────────────────────────── AWS (p5.48xlarge) $6.88/hr* ~$4.50/hr ~$2.10/hr Azure (NC H100) $6.98/hr* ~$4.20/hr ~$2.80/hr GCP (A3 High) ~$3.50/hr ~$2.50/hr ~$1.40/hr CoreWeave $4.76/hr ~$3.50/hr N/A Lambda Labs $2.99/hr N/A N/A Vast.ai $1.49/hr N/A N/A RunPod $1.99/hr N/A N/A * Per-GPU price (8-GPU node price / 8) Price range: 4.7× ($1.49 to $6.98) Same H100 hardware, vastly different prices. The premium buys: reliability, support, managed services, compliance, networking.

Key insight: Cloud GPU pricing is like airline tickets — the same seat costs 4.7× more depending on where you buy it. Hyperscalers charge a premium for reliability, ecosystem, and enterprise support. GPU marketplaces offer bare-metal prices but less hand-holding. Choose based on what your team needs, not just the hourly rate.

sell

Pricing Models: On-Demand, Reserved & Spot

How to pay 30–70% less for the same GPUs

On-Demand

Pay by the second/hour with no commitment. Full price, full flexibility. Spin up and tear down instantly.

Best for: Experimentation, prototyping, unpredictable workloads, short training runs (<1 week).

Cost: Highest per-hour rate. An 8× H100 node on AWS: ~$55/hr ($40K/month if running 24/7).

Reserved Instances

Commit to 1–3 years of usage in exchange for 30–50% discount. Pay whether you use the GPUs or not.

Best for: Production inference (steady-state traffic), long training campaigns, known capacity needs.

Risk: If your needs change (different GPU, less capacity), you’re locked in. Some providers allow resale of unused reservations.

AWS 3-year reserved: ~$2.97/hr per H100 (57% off on-demand). Requires upfront payment or monthly commitment.

Spot/Preemptible Instances

Use spare cloud capacity at 60–80% discount. The catch: your instance can be terminated with 30–120 seconds notice when demand rises.

Best for: Fault-tolerant training (with checkpointing), batch inference, hyperparameter sweeps, data preprocessing.

Not suitable for: Production inference (SLA requirements), interactive workloads, jobs that can’t checkpoint quickly.

Pricing Strategy by Workload

Workload Best Pricing Savings ────────────────────────────────────────────────── Experimentation On-demand 0% (baseline) Training (<1 week) On-demand/Spot 0-70% Training (>1 month) Reserved 30-50% Prod inference Reserved 30-50% Batch inference Spot 60-80% Hyperparameter search Spot 60-80% Optimization strategies: Per-second billing: saves 15-30% for bursty work Right-sizing: saves 10-20% (don't over-provision) Scheduling: saves 10-15% (off-peak hours) Combined: saves 30-60% total # Real example: $100K/month cloud GPU bill # After optimization: $40-60K/month # Savings: $480K-720K/year

Key insight: Cloud GPU pricing is like electricity rates — there’s a peak rate, an off-peak rate, and a long-term contract rate. Nobody pays peak rate for everything. The teams that save 50%+ aren’t using different hardware — they’re using different pricing models for different workloads.

visibility_off

Hidden Costs: What the Hourly Rate Doesn’t Tell You

Egress fees, storage, networking, and the 20–40% surcharge nobody budgets for

Data Egress

Cloud providers charge for data leaving their network. AWS: $0.09/GB for inter-region or internet egress. Moving a 15 TB training dataset out of AWS costs $1,350. Downloading model checkpoints regularly adds up fast.

This creates vendor lock-in: the more data you store in a cloud, the more expensive it is to leave. Some providers (Cloudflare R2, GCP) offer free or reduced egress to compete.

Storage Costs

Instance storage: Often included but ephemeral (lost on termination). Not suitable for persistent data.

Block storage (EBS): $0.08–0.16/GB-month for SSD. An 8 TB volume for model weights: $640–1,280/month.

Object storage (S3): $0.023/GB-month. Cheap for archival but slow for training.

File systems (FSx): $0.14/GB-month for Lustre. A 10 TB training volume: $1,400/month.

Storage costs often surprise teams: a 1,000-GPU training run might need 50+ TB of fast storage, costing $7,000+/month on top of GPU costs.

The Full Cost Picture

Advertised: 8× H100 on AWS = $55/hr Actual monthly cost (24/7 usage): GPU instances: $55 × 730 = $40,150 EBS storage (8TB): $1,280 FSx Lustre (10TB): $1,400 S3 storage (50TB): $1,150 Data transfer: $500 Networking (EFA): Included CloudWatch/logs: $200 Load balancer: $150 Total: $44,830 Hidden overhead: ~12% At larger scale (64× H100): GPU instances: $321,200 Storage + network: $25,000-40,000 Hidden overhead: ~8-12% Engineering overhead (not on the bill): DevOps/MLOps FTE: $150-250K/yr Monitoring tools: $5-20K/yr Security/compliance: $10-50K/yr

Key insight: Cloud GPU pricing is like budget airline tickets. The base fare looks great, but then you add baggage (storage), seat selection (networking), food (egress), and priority boarding (support). The final bill is 20–40% higher than the advertised rate. Always budget for the total cost, not just the GPU hourly rate.

home_work

On-Prem Economics: Building Your Own

Hardware costs, colocation, staffing, and the true cost of ownership

Hardware Costs

8-GPU H100 Node (DGX H100): GPUs (8× H100 SXM): ~$200,000 CPU, RAM, NVMe: ~$30,000 Networking (IB NDR): ~$25,000 Chassis, PSU, cooling: ~$28,000 Total per node: ~$283,000 16-GPU cluster (2 nodes): 2× DGX H100: $566,000 InfiniBand switch: $15,000 ToR Ethernet switch: $8,000 Storage (NAS, 50TB): $25,000 Cabling + misc: $5,000 Total CAPEX: ~$619,000 128-GPU cluster (16 nodes): 16× DGX H100: $4,528,000 IB fabric (leaf-spine): $180,000 Storage (Lustre, 200TB): $200,000 Total CAPEX: ~$4,908,000

Operating Costs (Annual)

8-GPU cluster (colocation): Colocation (10 kW): $24,000/yr Electricity ($0.10/kWh): $8,760/yr Network (1 Gbps): $6,000/yr Support contract: $15,000/yr Part-time admin (0.2 FTE): $40,000/yr Total OPEX: ~$93,760/yr 128-GPU cluster (colocation): Colocation (200 kW): $480,000/yr Electricity: $175,200/yr Network (100 Gbps): $60,000/yr Support contracts: $120,000/yr Staff (2 FTEs): $400,000/yr Spare parts/repairs: $50,000/yr Total OPEX: ~$1,285,200/yr # Electricity is 25-40% of OPEX # Staff is 15-30% of OPEX

Key insight: On-prem hardware is a depreciating asset. GPUs lose 30–50% of their value per year as new generations launch. An H100 bought for $25K in 2023 is worth ~$10K in 2025 with B200 available. Factor depreciation into your TCO — a 3-year amortization is standard, but the GPU may be obsolete in 2.

calculate

TCO Comparison: Real Numbers

Side-by-side analysis for training and inference workloads

Inference TCO (3-Year, 8× H100)

Cloud (AWS Reserved, 3yr): GPU: 8 × $2.97/hr × 8,760 × 3 = $625,450 Storage + network (12%): $75,054 Total 3-year: $700,504 Monthly: $19,458 On-Prem (colocation): Hardware CAPEX: $283,000 OPEX (3 years): $281,280 Total 3-year: $564,280 Monthly: $15,674 On-Prem savings: 19% ($136K over 3yr) # But on-prem requires: # - $283K upfront capital # - 3-6 month procurement lead time # - Staff to manage hardware # - Risk of GPU depreciation

Training TCO (Single 2-Week Run, 128 GPUs)

Cloud (AWS On-Demand, 2 weeks): 128 × $6.88/hr × 336 hrs = $296,140 Storage + network: $15,000 Total: $311,140 Cloud (Spot, 2 weeks): 128 × $2.10/hr × 336 hrs = $90,317 + Checkpoint overhead (~20%): $18,063 Total: $123,380 On-Prem (if you already own it): Electricity only: $5,400 Marginal cost: $5,400 On-Prem (amortized CAPEX + OPEX): 2-week share of annual cost: $237,800 (Only makes sense if cluster runs year-round) Verdict: Cloud wins for occasional training. On-prem wins if cluster is utilized >60%.

Key insight: The break-even between cloud and on-prem depends almost entirely on utilization. At 100% utilization, on-prem is 40–60% cheaper. At 30% utilization, cloud is cheaper because you only pay for what you use. The question isn’t “which is cheaper?” but “how consistently will you use the GPUs?”

merge

The Hybrid Approach

Own your baseline, rent your peaks

Hybrid Strategy

Most mature AI organizations use a hybrid approach that combines the cost efficiency of on-prem with the flexibility of cloud:

Own the baseline: Purchase enough GPUs to handle your steady-state workload (the minimum you’ll always need). This is your inference fleet, your always-on training capacity.

Rent the peaks: Use cloud GPUs for burst training runs, experimentation, new model evaluations, and traffic spikes. Pay on-demand or spot rates for temporary capacity.

Cloud for dev, on-prem for prod: Develop and iterate on cloud (fast, flexible). Deploy proven models to on-prem (cheap, controlled). This separates the “exploration” budget from the “exploitation” budget.

Hybrid Architecture Example

Scenario: AI company with 3 production models and active research.

On-prem (colocation): 64 GPUs for production inference. Runs 24/7 at 70%+ utilization. Handles 90% of inference traffic.

Cloud (reserved): 16 GPUs on GCP for overflow inference and A/B testing new models. Handles traffic spikes and canary deployments.

Cloud (spot): 128–512 GPUs on AWS spot for training runs. Spun up for 1–4 weeks, then terminated. Checkpoints saved to S3.

Hybrid Cost Savings

All-Cloud approach (annual): Inference (64 GPU, reserved): $1,400K Training (burst, on-demand): $600K Storage + egress: $200K Total: $2,200K Hybrid approach (annual): On-prem inference (64 GPU): $750K (CAPEX amortized + OPEX) Cloud training (spot): $200K Cloud overflow (16 GPU, res): $220K Storage + egress: $80K Total: $1,250K Savings: $950K/yr (43%) # The hybrid approach requires more engineering # effort (multi-cloud orchestration, data sync, # deployment pipelines) but the cost savings # justify 1-2 additional engineers.

Key insight: Hybrid infrastructure is like owning a car and occasionally renting a truck. You drive your car daily (on-prem inference) because it’s cheaper per mile. When you need to move furniture (training run), you rent a truck for the weekend. Owning a truck for occasional use would be wasteful; renting a car for daily commuting would be expensive.

decision

Decision Framework: When to Use What

A practical guide for choosing cloud, on-prem, or hybrid

Decision Tree

Start with cloud if:
• You’re in the experimentation phase
• Workloads are unpredictable or bursty
• Team is <5 ML engineers (no infra capacity)
• Budget is OPEX-only (no CAPEX approval)
• Need latest hardware immediately

Move to on-prem when:
• GPU utilization consistently exceeds 60–70%
• Monthly cloud bill exceeds $50K+ for 6+ months
• Data sovereignty requirements mandate it
• You have (or can hire) infrastructure expertise
• Workloads are stable and predictable

Go hybrid when:
• Steady-state + burst workloads coexist
• Different workloads have different requirements
• You want cost optimization without full commitment
• Team has both ML and infra expertise

Quick Reference by Company Stage

Stage GPUs Strategy Monthly Cost ────────────────────────────────────────────────── Startup/POC 1-8 Cloud OD $2-10K Growth 8-64 Cloud Res $10-80K Scale 64-256 Hybrid $50-200K Enterprise 256+ On-prem+Cloud $200K+ Key metrics to track: GPU utilization: Target >70% for on-prem Cost per token: Compare cloud vs self-hosted Time to deploy: Cloud: minutes, On-prem: months Engineering cost: $150-250K/FTE/yr Red flags (time to reconsider): Cloud bill growing >20%/month for 3+ months GPU utilization <30% on on-prem Spending >$100K/month on cloud with stable load On-prem hardware >2 generations behind

Key insight: The best infrastructure strategy evolves with your organization. Start on cloud (learn fast, fail cheap), migrate steady workloads to on-prem (optimize cost), and keep burst capacity on cloud (stay flexible). The companies that get this wrong either overspend on cloud or over-commit to on-prem too early. Review your strategy every 6 months.

arrow_backPrevious Chapter Next Chapterarrow_forward