Ch 13 — Orchestration & Scheduling: Managing GPU Fleets

Kubernetes, Slurm, Ray, GPU utilization, multi-tenancy, and autoscaling
Hands-On
help
Why Orch.
arrow_forward
cloud_circle
Kubernetes
arrow_forward
terminal
Slurm
arrow_forward
scatter_plot
Ray
arrow_forward
speed
Utilization
arrow_forward
groups
Multi-Tenant
arrow_forward
auto_mode
Autoscale
arrow_forward
architecture
Platform
-
Click play or press Space to begin...
Step- / 8
help
Why Orchestration Matters
GPUs are too expensive to waste — scheduling is the difference between 30% and 90% utilization
The Utilization Crisis
Industry-average GPU utilization is 30–50%. That means on a 1,000-GPU cluster costing $21.9M/year, $11–15M of compute sits idle. The GPUs aren’t broken — they’re waiting. Waiting for jobs to be scheduled, for data to load, for other jobs to finish, for someone to notice they’re free.

Well-managed clusters achieve 85–95% utilization. The difference is orchestration: the software that decides which job runs on which GPU, when, and for how long.

The scheduling problem: Multiple teams want GPUs simultaneously. Training jobs need many GPUs for days. Inference needs few GPUs but with strict latency SLAs. Experimentation needs GPUs for hours. How do you allocate a shared, expensive resource fairly and efficiently?
What an Orchestrator Does
Core responsibilities: 1. Resource allocation: Match GPU requests to available hardware Respect topology (NVLink, same-node, same-rack) 2. Job scheduling: Queue, prioritize, preempt, and gang-schedule Ensure fairness across teams/users 3. Lifecycle management: Start, monitor, restart, and clean up jobs Handle failures (GPU errors, OOM, timeouts) 4. Multi-tenancy: Isolate workloads (security, resource limits) Quota management per team/project 5. Autoscaling: Add/remove nodes based on demand Scale inference replicas on traffic Cost of bad scheduling (1,000 GPUs): 30% utilization: $15.3M/yr wasted 85% utilization: $3.3M/yr wasted Difference: $12M/yr
Key insight: GPU orchestration is like air traffic control. Without it, planes (jobs) circle the runway (wait in queue) while gates (GPUs) sit empty. A good controller lands planes faster, uses every gate, and prevents collisions. The difference between a busy airport and an efficient one isn’t more runways — it’s better scheduling.
cloud_circle
Kubernetes for GPU Workloads
The cloud-native approach to GPU orchestration
How K8s Manages GPUs
Kubernetes treats GPUs as extended resources via the NVIDIA Device Plugin. The plugin discovers GPUs on each node and registers them with the kubelet. Pods request GPUs in their resource spec, and the scheduler places them on nodes with available GPUs.

Key components:
NVIDIA Device Plugin: Exposes GPUs as `nvidia.com/gpu` resources
NVIDIA GPU Operator: Automates driver, toolkit, and plugin installation
Node selectors/affinity: Target specific GPU types (H100 vs A100)
Topology-aware scheduling: Place multi-GPU jobs on the same node for NVLink
MIG support: Partition A100/H100 into smaller GPU instances
K8s GPU Pod Example
# Request 4 GPUs for a training job apiVersion: batch/v1 kind: Job metadata: name: llama-finetune spec: template: spec: containers: - name: trainer image: nvcr.io/nvidia/pytorch:24.03-py3 resources: limits: nvidia.com/gpu: 4 memory: "256Gi" nodeSelector: gpu-type: "h100" restartPolicy: OnFailure
K8s Strengths & Limitations for AI
Strengths:
• Ecosystem: Helm charts, operators, monitoring (Prometheus/Grafana)
• Multi-tenancy: Namespaces, RBAC, resource quotas
• Autoscaling: HPA for inference, Karpenter for node provisioning
• Portability: Same manifests across cloud providers
• Service mesh: Load balancing, traffic routing for inference

Limitations:
• No native gang scheduling (need KAI Scheduler or Volcano)
• GPU topology-unaware by default (need custom schedulers)
• No built-in job queuing with fair-share policies
• Complex networking for multi-node training (MPI, NCCL)
• Overhead: etcd, API server, controller manager add latency
KAI Scheduler (NVIDIA, 2025)
NVIDIA’s KAI Scheduler adds AI-native scheduling to Kubernetes:

Gang scheduling: All pods in a training job start together or not at all
GPU sharing: Multiple small jobs on one GPU
Hierarchical queues: Fair-share with DRF (Dominant Resource Fairness)
Topology-aware: Respects NVLink, NVSwitch, and rack locality
Preemption: Low-priority jobs yield to high-priority training
Key insight: Kubernetes is the Swiss Army knife of orchestration — it does everything, but nothing perfectly for AI out of the box. You need additional tools (KAI Scheduler, GPU Operator, custom metrics) to make it AI-native. The ecosystem is converging fast, but expect 3–6 months of setup to get K8s right for GPU workloads.
terminal
Slurm: The HPC Workhorse
Battle-tested job scheduling for large-scale GPU clusters
What Is Slurm?
Slurm (Simple Linux Utility for Resource Management) is the dominant job scheduler in HPC and AI research. It manages job queues, allocates resources, and handles multi-node job execution. Meta, NVIDIA, and most national labs use Slurm for training clusters.

Key concepts:
Partitions: Groups of nodes (e.g., `gpu-h100`, `gpu-a100`, `cpu-only`)
Jobs: Submitted via `sbatch` scripts with resource requests
Accounts: Organizational units with quotas and priorities
Fair-share: Balances GPU time across accounts based on allocation
GRES (Generic Resources): GPU-aware scheduling (`--gres=gpu:h100:8`)
Slurm Job Example
#!/bin/bash #SBATCH --job-name=llama-train #SBATCH --partition=gpu-h100 #SBATCH --nodes=4 #SBATCH --ntasks-per-node=8 #SBATCH --gres=gpu:h100:8 #SBATCH --time=72:00:00 #SBATCH --account=ml-research module load cuda/12.4 nccl/2.21 srun torchrun --nproc_per_node=8 \ --nnodes=4 \ --rdzv_backend=c10d \ train.py --model llama-70b \ --batch_size 32 --lr 1e-4
Slurm vs Kubernetes for AI
Feature Slurm Kubernetes ────────────────────────────────────────────────── Primary use HPC/Training Cloud-native/Inference Gang scheduling Native Needs plugin Fair-share Built-in Needs plugin Multi-node MPI Native (srun) Complex (MPI Operator) GPU topology GRES + topology Needs custom scheduler Autoscaling Limited Native (HPA/Karpenter) Service mesh None Istio/Envoy Containerization Optional Required Learning curve Moderate Steep Ecosystem HPC tools Cloud-native tools
Project Slinky: Slurm on Kubernetes
Project Slinky (2025) bridges the two worlds by running Slurm jobs on Kubernetes infrastructure. Users submit familiar `sbatch` scripts, but execution happens in Kubernetes pods. This gives HPC users their preferred interface while leveraging K8s infrastructure, autoscaling, and cloud integration.

Rafay’s GPU PaaS extends this with self-service, multi-tenant Slurm clusters on Kubernetes — each user gets an isolated Slurm environment backed by shared GPU infrastructure.
Key insight: Slurm and Kubernetes aren’t competitors — they’re complementary. Slurm excels at batch training jobs (gang scheduling, fair-share, MPI). Kubernetes excels at serving workloads (autoscaling, service mesh, rolling updates). The trend is convergence: Slurm on Kubernetes gives you both.
scatter_plot
Ray: Distributed AI Framework
From training to serving to data processing — one framework for the full AI pipeline
What Is Ray?
Ray is an open-source framework for building distributed AI applications. Unlike Slurm (job scheduler) or Kubernetes (container orchestrator), Ray is a programming framework that handles distribution transparently.

Ray ecosystem:
Ray Train: Distributed training (wraps PyTorch DDP, FSDP, DeepSpeed)
Ray Serve: Model serving with autoscaling and traffic routing
Ray Data: Distributed data preprocessing and loading
Ray Tune: Hyperparameter optimization at scale
RayCluster (K8s): Deploy Ray on Kubernetes with auto-scaling

Ray runs on top of Kubernetes or Slurm. It doesn’t replace them — it provides a higher-level abstraction for AI workloads.
Ray Train Example
import ray from ray.train.torch import TorchTrainer from ray.train import ScalingConfig def train_func(): # Standard PyTorch training code model = LlamaForCausalLM.from_pretrained(...) model = ray.train.torch.prepare_model(model) # Training loop... trainer = TorchTrainer( train_func, scaling_config=ScalingConfig( num_workers=32, use_gpu=True, resources_per_worker={"GPU": 1} ) ) result = trainer.fit() # Ray handles: distribution, fault tolerance, # checkpointing, and resource management. # Same code runs on laptop (1 GPU) or # cluster (1,000 GPUs).
When to Use Ray
Use Ray when: You need a unified framework for training + serving + data processing. When your team is Python-first and wants to avoid YAML/infrastructure complexity. When you need elastic training (add/remove GPUs mid-job).

Skip Ray when: You only need batch job scheduling (use Slurm). When you need fine-grained infrastructure control. When your team already has deep K8s/Slurm expertise.
Key insight: Ray is to distributed AI what React is to web development — a framework that handles the hard parts (distribution, state management, scaling) so you can focus on the logic. You still need infrastructure underneath (K8s, Slurm), but Ray makes the distributed parts feel like writing single-machine code.
speed
GPU Utilization: Measuring and Improving
The metrics that matter and the levers that move them
Utilization Metrics
GPU Compute Utilization: Percentage of time the GPU’s SMs are active. Reported by `nvidia-smi`. Can be misleading — a GPU at 100% utilization might be running a poorly optimized kernel that uses 10% of peak FLOPS.

GPU Memory Utilization: Percentage of HBM in use. High memory utilization with low compute utilization indicates memory-bound workloads (common in inference).

SM Occupancy: Percentage of warps active on SMs. Better indicator of actual compute efficiency than utilization percentage.

MFU (Model FLOPS Utilization): Actual FLOPS achieved divided by theoretical peak. The gold standard for training efficiency. Good training runs achieve 40–60% MFU. World-class: 55–65%.
Common Utilization Killers
Problem Impact Fix ────────────────────────────────────────────────── Job queue gaps 10-30% Better scheduling Data loading stalls 5-20% Prefetch, local NVMe Communication overhead 10-25% Overlap compute+comm Unbalanced parallelism 5-15% Tune TP/PP/DP ratios Memory fragmentation 5-10% PagedAttention, GC Cold start / model loading 5-15% Pre-pull, warm pools Checkpointing stalls 2-10% Async checkpointing Debugging / idle sessions 10-30% Timeout policies Monitoring stack: nvidia-smi / DCGM: GPU metrics Prometheus: Time-series collection Grafana: Dashboards Custom exporters: MFU, queue depth, wait time Target utilization by workload: Training (large): 85-95% (well-scheduled) Training (small): 60-80% (queue gaps) Inference: 40-70% (traffic-dependent) Development: 20-40% (interactive use)
Key insight: GPU utilization is like a restaurant’s table turnover rate. A full restaurant (100% utilization) doesn’t mean every table is profitable — some might have campers nursing one coffee for hours. The real metric is revenue per table-hour (MFU): how much useful work each GPU does per unit time, not just whether it’s busy.
groups
Multi-Tenancy: Sharing GPUs Fairly
Quotas, priorities, fair-share, and isolation for shared GPU clusters
The Sharing Problem
Most organizations have multiple teams competing for GPU resources: research wants GPUs for experiments, production needs GPUs for inference, data science needs GPUs for fine-tuning. Without governance, the loudest team gets the most GPUs.

Fair-share scheduling allocates GPU time proportionally to each team’s quota. If Team A has a 60% quota and Team B has 40%, over time they’ll receive roughly that ratio of GPU-hours — even if demand fluctuates.
Multi-Tenancy Patterns
1. Namespace isolation (K8s): Each team gets a namespace with resource quotas. Simple but coarse-grained — can’t share unused quota across teams.

2. Hierarchical queues (Slurm/KAI): Organization → Department → Team → User. Unused quota cascades up and can be borrowed by other teams. Reclaimed when the owner needs it (preemption).

3. Priority classes: Critical (production inference), High (training deadlines), Normal (research), Low (experimentation). Higher priority preempts lower priority with configurable grace periods.

4. GPU time banking: Teams accumulate “credits” for unused allocation. Can spend credits on burst usage later. Prevents use-it-or-lose-it hoarding.
Fair-Share Configuration Example
# Slurm fair-share configuration Organization: AI Company (1,000 GPUs) Account Hierarchy: research/ 40% share (400 GPUs) ├── nlp/ 50% of research (200 GPUs) ├── vision/ 30% of research (120 GPUs) └── speech/ 20% of research (80 GPUs) production/ 40% share (400 GPUs) ├── inference/ 70% of prod (280 GPUs) └── training/ 30% of prod (120 GPUs) exploration/ 20% share (200 GPUs) Preemption policy: production > research > exploration Grace period: 5 min (checkpoint + exit) Borrowing: If research is idle, production can use its GPUs. Reclaimed within 5 min when research submits a job.
Key insight: Multi-tenancy for GPUs is like managing a shared kitchen in a co-living space. Without rules, one person monopolizes the stove all evening. Fair-share is a sign-up sheet: everyone gets their time slot, but if you don’t show up, others can use your slot. Preemption is the rule that dinner prep takes priority over midnight snacks.
auto_mode
Autoscaling: Matching Capacity to Demand
Scaling inference replicas, training workers, and infrastructure nodes
Three Levels of Autoscaling
1. Pod autoscaling (HPA): Scale inference replicas based on metrics (request rate, latency, GPU utilization). Kubernetes HPA with custom metrics from the inference engine. Reacts in seconds.

2. Node autoscaling (Karpenter/CA): Add or remove GPU nodes based on pending pods. Karpenter provisions the right instance type (H100 vs A100) based on pod requirements. Reacts in minutes (cloud) or not at all (on-prem).

3. Cluster autoscaling: Scale entire clusters or regions based on aggregate demand. Used by hyperscalers to shift workloads between data centers. Reacts in hours.
Inference Autoscaling Example
# K8s HPA for vLLM inference apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llama-inference-hpa spec: scaleTargetRef: name: llama-inference minReplicas: 2 maxReplicas: 16 metrics: - type: Pods pods: metric: name: vllm_pending_requests target: type: AverageValue averageValue: 10
Autoscaling Challenges
Cold start latency: Loading a 70B model takes 30–120 seconds. New replicas can’t serve traffic during loading. Mitigation: keep warm spare replicas, use predictive scaling.

GPU node provisioning: Cloud GPU nodes take 2–5 minutes to provision (vs seconds for CPU). Plan for this lag in scaling policies.

Cost spikes: Aggressive autoscaling can cause unexpected cost spikes. Set hard limits on maximum replicas and implement cost alerts.

Scale-to-zero: Unloading models when traffic drops to zero saves cost but introduces cold start on the next request. Acceptable for internal tools, not for customer-facing APIs.
Scaling Strategies by Workload
Workload Scaling Metric Target ────────────────────────────────────────────────── Chat inference Pending requests <10 per replica Batch inference Queue depth <100 pending Training Not autoscaled Fixed allocation Fine-tuning Job queue length <5 waiting jobs Predictive scaling: Use historical traffic patterns Pre-scale 15 min before expected peak Reduces cold starts by 80-90%
Key insight: Autoscaling GPUs is like staffing a restaurant. You need enough chefs (replicas) for the dinner rush but don’t want to pay them during the 3 PM lull. The trick is predicting the rush: look at yesterday’s reservations (historical traffic) and start prepping (pre-scaling) before the first customer arrives.
architecture
Building an AI Platform
Putting it all together: the reference architecture for GPU orchestration
Reference Platform Stack
Layer 5: User Interface JupyterHub, CLI, Web UI, API Job submission, monitoring, model registry Layer 4: AI Frameworks Ray Train/Serve, DeepSpeed, Megatron-LM vLLM, TensorRT-LLM, SGLang Layer 3: Orchestration Kubernetes + KAI Scheduler (inference) Slurm or Project Slinky (training) Karpenter (node autoscaling) Layer 2: Infrastructure NVIDIA GPU Operator, Device Plugin InfiniBand / RoCEv2 networking Lustre / GPFS storage, S3 Layer 1: Hardware DGX H100/H200, GB200 NVL72 InfiniBand switches, NVMe storage Power, cooling, physical security
Platform Maturity Levels
Level 1 — Manual (most startups): SSH into GPU machines, run scripts No scheduling, no monitoring Utilization: 20-30% Level 2 — Basic Orchestration: Kubernetes + GPU plugin, or Slurm Basic job queuing, manual scaling Utilization: 40-60% Level 3 — Managed Platform: Fair-share scheduling, autoscaling Monitoring dashboards, cost tracking Multi-tenancy with quotas Utilization: 60-80% Level 4 — Optimized Platform: Topology-aware scheduling Predictive autoscaling Automated failure recovery Cost optimization (spot, preemption) Utilization: 80-95% Each level requires ~6 months and 1-2 FTEs
Key insight: An AI platform is never “done” — it’s a living system that evolves with your organization. Start at Level 1 (just get GPUs working), invest in Level 2 when you hit 10+ GPUs, and build toward Level 3–4 as your fleet grows. The ROI of each level is clear: every 10% utilization improvement on 100 GPUs saves ~$200K/year.