Reference Platform Stack
Layer 5: User Interface
JupyterHub, CLI, Web UI, API
Job submission, monitoring, model registry
Layer 4: AI Frameworks
Ray Train/Serve, DeepSpeed, Megatron-LM
vLLM, TensorRT-LLM, SGLang
Layer 3: Orchestration
Kubernetes + KAI Scheduler (inference)
Slurm or Project Slinky (training)
Karpenter (node autoscaling)
Layer 2: Infrastructure
NVIDIA GPU Operator, Device Plugin
InfiniBand / RoCEv2 networking
Lustre / GPFS storage, S3
Layer 1: Hardware
DGX H100/H200, GB200 NVL72
InfiniBand switches, NVMe storage
Power, cooling, physical security