Ch 8: Training Infrastructure & Tools

Ch 8 — Training Infrastructure & Tools

HuggingFace ecosystem, Axolotl, Unsloth, cloud platforms, experiment tracking, and managed services

Index Under the Hood →

High Level

hub

HF Stack

arrow_forward

build

Frameworks

arrow_forward

cloud

Cloud GPU

arrow_forward

monitoring

Tracking

arrow_forward

auto_fix_high

Managed

arrow_forward

laptop_mac

Local

arrow_forward

checklist

Choose

Click play or press Space to begin the journey...

Step- / 7

hub

The HuggingFace Ecosystem

The standard stack for LLM fine-tuning

Core Libraries

Transformers: Model loading, tokenization, and inference. Supports 200K+ models on the Hub. The foundation of the ecosystem.

PEFT: Parameter-Efficient Fine-Tuning. LoRA, QLoRA, DoRA, IA3, prefix tuning. Integrates seamlessly with Transformers.

TRL: Transformer Reinforcement Learning. SFTTrainer, DPOTrainer, ORPOTrainer, KTOTrainer, PPOTrainer, RewardTrainer. The alignment library.

Datasets: Data loading, processing, and streaming. Supports Arrow format for memory-efficient processing of large datasets.

Accelerate: Distributed training abstraction. DeepSpeed, FSDP, multi-GPU, multi-node. One config file to switch backends.

The HuggingFace Hub

Model Hub: 700K+ models. Download any model with one line: AutoModelForCausalLM.from_pretrained("model-name").

Dataset Hub: 100K+ datasets. Load any dataset: load_dataset("dataset-name").

Spaces: Host demos and apps. Gradio and Streamlit integration.

Model cards: Documentation, benchmarks, and usage instructions for every model.

The HuggingFace stack is the de facto standard. Nearly every open-source fine-tuning project uses Transformers + PEFT + TRL. Learning this stack is the single most valuable investment for LLM fine-tuning. All the code examples in this course use these libraries.

build

Training Frameworks

Higher-level tools that simplify the training workflow

Axolotl

What: A YAML-driven fine-tuning framework built on top of HuggingFace. Define your entire training run in a single YAML config file.

Strengths: Supports SFT, DPO, ORPO, LoRA, QLoRA, full fine-tuning, DeepSpeed, FSDP, Flash Attention, sequence packing, multi-dataset mixing. Very popular in the open-source community.

Best for: Practitioners who want a battle-tested config-driven workflow without writing Python training scripts. Used by many top models on the Open LLM Leaderboard.

Unsloth

What: A speed-optimized fine-tuning library. Claims 2x faster training and 60% less memory through custom CUDA kernels and optimized backpropagation.

Strengths: Fastest LoRA/QLoRA training available. Free tier for single GPU. Supports Llama, Mistral, Gemma, Phi, Qwen. Notebook-friendly (great for Colab/Kaggle).

Best for: Single-GPU training where speed and memory efficiency matter most. Great for prototyping on consumer hardware or free cloud notebooks.

LLaMA-Factory

What: A unified fine-tuning framework with a web UI. Supports 100+ models, all major training methods, and includes a visual training dashboard.

Strengths: Web-based configuration (no YAML or code needed). Built-in evaluation. Supports SFT, RLHF, DPO, LoRA, QLoRA, full FT. Good for teams with mixed technical levels.

Best for: Teams that want a visual interface for fine-tuning without deep Python expertise.

Axolotl
YAML config driven
Battle-tested at scale
Multi-GPU/node
Community favorite

Unsloth
Speed optimized
2x faster, 60% less memory
Single GPU focus
Great for prototyping

LLaMA-Factory
Web UI
100+ models
No code needed
Good for teams

Raw TRL/HF
Maximum control
Python scripts
Most flexible
For experts

cloud

Cloud GPU Providers

Where to rent GPUs for fine-tuning

GPU Cloud Providers

Lambda Labs: Simple pricing, good availability. A100/H100 instances. Popular for ML research. $1.10/hr for A100 80GB.

RunPod: On-demand and spot GPUs. Community cloud (cheaper) and secure cloud. Good API. $0.44/hr for A100 80GB (community).

Together AI: Fine-tuning API + GPU cloud. Can fine-tune through their API without managing infrastructure. Also offers raw GPU access.

Vast.ai: Marketplace for GPU rentals. Cheapest option but variable quality. Good for experimentation, not production.

Modal: Serverless GPU compute. Pay per second. Great for burst workloads and CI/CD. No idle costs.

Hyperscalers

AWS (SageMaker): Most mature ML platform. SageMaker Training Jobs handle infrastructure. Expensive but reliable. Spot instances available.

Google Cloud (Vertex AI): Good TPU access. Vertex AI Training for managed jobs. GKE for custom setups.

Azure (Azure ML): Best for enterprise. Azure ML managed training. Good integration with OpenAI ecosystem.

Pricing comparison (A100 80GB, on-demand):
Lambda: ~$1.10/hr
RunPod: ~$1.64/hr (secure), ~$0.44/hr (community)
AWS: ~$4.10/hr (p4d.24xlarge / 8 GPUs)
GCP: ~$3.67/hr (a2-highgpu-1g)

For most fine-tuning: Use Lambda Labs or RunPod for simplicity and cost. Use spot/preemptible instances (50-70% cheaper) and save checkpoints frequently. Only use hyperscalers if you need enterprise features (compliance, VPC, IAM) or are already in their ecosystem.

monitoring

Experiment Tracking & Monitoring

Logging metrics, comparing runs, and debugging training

Weights & Biases (WandB)

The most popular experiment tracker for LLM training. Automatic logging of loss, learning rate, GPU utilization, and custom metrics. Compare runs side-by-side. Share results with team.

Integration: One line in TrainingArguments: report_to="wandb". HuggingFace Trainer logs everything automatically.

Key features: Loss curves, gradient norms, learning rate schedules, system metrics (GPU memory, utilization), hyperparameter sweeps, model artifact tracking.

TensorBoard

Free, open-source alternative. Built into PyTorch and HuggingFace. Less polished UI than WandB but no account needed. Good for local development.

Integration: report_to="tensorboard" in TrainingArguments. View with tensorboard --logdir ./runs.

What to Track

Training metrics:
- train/loss (should decrease smoothly)
- eval/loss (should decrease, then plateau)
- learning_rate (verify schedule is correct)
- grad_norm (should be stable, not exploding)

For DPO/alignment:
- rewards/chosen and rewards/rejected
- rewards/margins and rewards/accuracies

System metrics:
- GPU memory utilization (aim for 80-95%)
- GPU compute utilization (aim for >70%)
- Throughput (tokens/second or samples/second)

Always use experiment tracking. Even for quick experiments. You will forget hyperparameters, lose track of which run was best, and waste time re-running experiments. WandB free tier is sufficient for individual use. TensorBoard is fine for local development.

auto_fix_high

Managed Fine-Tuning Services

Fine-tune through an API without managing infrastructure

OpenAI Fine-Tuning

Models: GPT-4o, GPT-4o-mini, GPT-3.5-turbo
Method: Upload JSONL, configure epochs, launch via API
Cost: GPT-4o-mini: $0.30/1M training tokens. GPT-4o: $25/1M training tokens
Pros: Simplest possible workflow. No GPU management. Good for production.
Cons: No control over training method (no LoRA, no DPO). Black box. Model stays on OpenAI servers. Limited to their models.

Together AI Fine-Tuning

Models: Llama 3, Mistral, Qwen, and other open-source models
Method: Upload data, select model, configure via API or UI
Cost: Varies by model size. Competitive with self-hosted.
Pros: Open-source models. More control than OpenAI. Can download the fine-tuned model.
Cons: Less mature than OpenAI. Limited customization.

Other Managed Services

Fireworks AI: Fast fine-tuning and inference. Good for LoRA adapters. Supports adapter serving (multiple LoRAs on one base model).

Anyscale (now part of Databricks): Ray-based training platform. Good for large-scale distributed training.

AWS SageMaker JumpStart: One-click fine-tuning for popular models. Integrated with AWS ecosystem.

Google Vertex AI: Fine-tune Gemini models through the API. Also supports open-source models on GKE.

When to use managed services: (1) You don't want to manage GPUs. (2) You need production reliability. (3) Your team lacks ML infrastructure expertise. (4) You're fine-tuning closed models (GPT-4o). When to self-host: (1) You need full control. (2) Cost optimization matters. (3) Data privacy requirements. (4) You need DPO/RLHF/custom methods.

laptop_mac

Local & Consumer Hardware

Fine-tuning on your own machine

Consumer GPUs for Fine-Tuning

NVIDIA RTX 4090 (24 GB): Best consumer GPU for fine-tuning. QLoRA on 7B-13B models. Can handle 70B with aggressive quantization. ~$1,600.

NVIDIA RTX 3090 (24 GB): Previous gen but still capable. QLoRA on 7B models. No bf16 support (use fp16). ~$800 used.

NVIDIA RTX 4080 (16 GB): QLoRA on 7B models (tight). Not recommended for larger models.

Apple M-series (unified memory): MLX framework supports fine-tuning. M2 Ultra (192 GB) can fine-tune 70B models. Slower than NVIDIA but no VRAM limit. Good for experimentation.

Free Cloud Options

Google Colab (free): T4 GPU (16 GB). QLoRA on 7B models. Limited to ~12 hours per session. Good for learning.

Google Colab Pro ($10/mo): A100 40GB access. Much better for fine-tuning. Longer sessions.

Kaggle Notebooks (free): 2x T4 GPUs (16 GB each). 30 hours/week. Good for competitions and experimentation.

Lightning AI Studios (free tier): Free GPU access with persistent storage. Good for development.

The QLoRA revolution: Before QLoRA (2023), fine-tuning 7B models required 80+ GB of GPU memory. Now, QLoRA fits on a 16 GB consumer GPU. This democratized fine-tuning. You can prototype on Colab, iterate on a 4090, and scale to cloud A100s for production training.

checklist

Choosing Your Stack

Recommendations based on your situation

Beginner / Learning

Hardware: Google Colab (free) or Kaggle
Framework: Unsloth (fastest, notebook-friendly)
Tracking: WandB free tier
Cost: $0

Individual Practitioner

Hardware: RTX 4090 (local) or RunPod (cloud)
Framework: TRL + PEFT (maximum control) or Axolotl (config-driven)
Tracking: WandB
Cost: $0-$50/month

Startup / Small Team

Hardware: Lambda Labs or RunPod (multi-GPU)
Framework: Axolotl or raw TRL + Accelerate
Tracking: WandB Teams
Cost: $200-$2,000/month

Enterprise

Hardware: AWS SageMaker, Azure ML, or GCP Vertex AI
Framework: Custom TRL scripts or managed fine-tuning
Tracking: WandB Enterprise or MLflow
Cost: $2,000+/month

Just Want It Done

Closed models: OpenAI fine-tuning API (simplest)
Open models: Together AI fine-tuning API
Tracking: Built into the platform
Cost: Pay per training token

Start simple, scale up. Begin with Colab + Unsloth to learn. Move to Axolotl or TRL when you need more control. Use cloud GPUs when local hardware is insufficient. Use managed services when you don't want to manage infrastructure. The tools are mature enough that the bottleneck is usually data quality, not infrastructure.