Ch 8 — Training Infrastructure & Tools

HuggingFace ecosystem, Axolotl, Unsloth, cloud platforms, experiment tracking, and managed services
High Level
hub
HF Stack
arrow_forward
build
Frameworks
arrow_forward
cloud
Cloud GPU
arrow_forward
monitoring
Tracking
arrow_forward
auto_fix_high
Managed
arrow_forward
laptop_mac
Local
arrow_forward
checklist
Choose
-
Click play or press Space to begin the journey...
Step- / 7
hub
The HuggingFace Ecosystem
The standard stack for LLM fine-tuning
Core Libraries
Transformers: Model loading, tokenization, and inference. Supports 200K+ models on the Hub. The foundation of the ecosystem.

PEFT: Parameter-Efficient Fine-Tuning. LoRA, QLoRA, DoRA, IA3, prefix tuning. Integrates seamlessly with Transformers.

TRL: Transformer Reinforcement Learning. SFTTrainer, DPOTrainer, ORPOTrainer, KTOTrainer, PPOTrainer, RewardTrainer. The alignment library.

Datasets: Data loading, processing, and streaming. Supports Arrow format for memory-efficient processing of large datasets.

Accelerate: Distributed training abstraction. DeepSpeed, FSDP, multi-GPU, multi-node. One config file to switch backends.
The HuggingFace Hub
Model Hub: 700K+ models. Download any model with one line: AutoModelForCausalLM.from_pretrained("model-name").

Dataset Hub: 100K+ datasets. Load any dataset: load_dataset("dataset-name").

Spaces: Host demos and apps. Gradio and Streamlit integration.

Model cards: Documentation, benchmarks, and usage instructions for every model.
The HuggingFace stack is the de facto standard. Nearly every open-source fine-tuning project uses Transformers + PEFT + TRL. Learning this stack is the single most valuable investment for LLM fine-tuning. All the code examples in this course use these libraries.
build
Training Frameworks
Higher-level tools that simplify the training workflow
Axolotl
What: A YAML-driven fine-tuning framework built on top of HuggingFace. Define your entire training run in a single YAML config file.

Strengths: Supports SFT, DPO, ORPO, LoRA, QLoRA, full fine-tuning, DeepSpeed, FSDP, Flash Attention, sequence packing, multi-dataset mixing. Very popular in the open-source community.

Best for: Practitioners who want a battle-tested config-driven workflow without writing Python training scripts. Used by many top models on the Open LLM Leaderboard.
Unsloth
What: A speed-optimized fine-tuning library. Claims 2x faster training and 60% less memory through custom CUDA kernels and optimized backpropagation.

Strengths: Fastest LoRA/QLoRA training available. Free tier for single GPU. Supports Llama, Mistral, Gemma, Phi, Qwen. Notebook-friendly (great for Colab/Kaggle).

Best for: Single-GPU training where speed and memory efficiency matter most. Great for prototyping on consumer hardware or free cloud notebooks.
LLaMA-Factory
What: A unified fine-tuning framework with a web UI. Supports 100+ models, all major training methods, and includes a visual training dashboard.

Strengths: Web-based configuration (no YAML or code needed). Built-in evaluation. Supports SFT, RLHF, DPO, LoRA, QLoRA, full FT. Good for teams with mixed technical levels.

Best for: Teams that want a visual interface for fine-tuning without deep Python expertise.
Axolotl
YAML config driven
Battle-tested at scale
Multi-GPU/node
Community favorite
Unsloth
Speed optimized
2x faster, 60% less memory
Single GPU focus
Great for prototyping
LLaMA-Factory
Web UI
100+ models
No code needed
Good for teams
Raw TRL/HF
Maximum control
Python scripts
Most flexible
For experts
cloud
Cloud GPU Providers
Where to rent GPUs for fine-tuning
GPU Cloud Providers
Lambda Labs: Simple pricing, good availability. A100/H100 instances. Popular for ML research. $1.10/hr for A100 80GB.

RunPod: On-demand and spot GPUs. Community cloud (cheaper) and secure cloud. Good API. $0.44/hr for A100 80GB (community).

Together AI: Fine-tuning API + GPU cloud. Can fine-tune through their API without managing infrastructure. Also offers raw GPU access.

Vast.ai: Marketplace for GPU rentals. Cheapest option but variable quality. Good for experimentation, not production.

Modal: Serverless GPU compute. Pay per second. Great for burst workloads and CI/CD. No idle costs.
Hyperscalers
AWS (SageMaker): Most mature ML platform. SageMaker Training Jobs handle infrastructure. Expensive but reliable. Spot instances available.

Google Cloud (Vertex AI): Good TPU access. Vertex AI Training for managed jobs. GKE for custom setups.

Azure (Azure ML): Best for enterprise. Azure ML managed training. Good integration with OpenAI ecosystem.

Pricing comparison (A100 80GB, on-demand):
Lambda: ~$1.10/hr
RunPod: ~$1.64/hr (secure), ~$0.44/hr (community)
AWS: ~$4.10/hr (p4d.24xlarge / 8 GPUs)
GCP: ~$3.67/hr (a2-highgpu-1g)
For most fine-tuning: Use Lambda Labs or RunPod for simplicity and cost. Use spot/preemptible instances (50-70% cheaper) and save checkpoints frequently. Only use hyperscalers if you need enterprise features (compliance, VPC, IAM) or are already in their ecosystem.
monitoring
Experiment Tracking & Monitoring
Logging metrics, comparing runs, and debugging training
Weights & Biases (WandB)
The most popular experiment tracker for LLM training. Automatic logging of loss, learning rate, GPU utilization, and custom metrics. Compare runs side-by-side. Share results with team.

Integration: One line in TrainingArguments: report_to="wandb". HuggingFace Trainer logs everything automatically.

Key features: Loss curves, gradient norms, learning rate schedules, system metrics (GPU memory, utilization), hyperparameter sweeps, model artifact tracking.
TensorBoard
Free, open-source alternative. Built into PyTorch and HuggingFace. Less polished UI than WandB but no account needed. Good for local development.

Integration: report_to="tensorboard" in TrainingArguments. View with tensorboard --logdir ./runs.
What to Track
Training metrics:
- train/loss (should decrease smoothly)
- eval/loss (should decrease, then plateau)
- learning_rate (verify schedule is correct)
- grad_norm (should be stable, not exploding)

For DPO/alignment:
- rewards/chosen and rewards/rejected
- rewards/margins and rewards/accuracies

System metrics:
- GPU memory utilization (aim for 80-95%)
- GPU compute utilization (aim for >70%)
- Throughput (tokens/second or samples/second)
Always use experiment tracking. Even for quick experiments. You will forget hyperparameters, lose track of which run was best, and waste time re-running experiments. WandB free tier is sufficient for individual use. TensorBoard is fine for local development.
auto_fix_high
Managed Fine-Tuning Services
Fine-tune through an API without managing infrastructure
OpenAI Fine-Tuning
Models: GPT-4o, GPT-4o-mini, GPT-3.5-turbo
Method: Upload JSONL, configure epochs, launch via API
Cost: GPT-4o-mini: $0.30/1M training tokens. GPT-4o: $25/1M training tokens
Pros: Simplest possible workflow. No GPU management. Good for production.
Cons: No control over training method (no LoRA, no DPO). Black box. Model stays on OpenAI servers. Limited to their models.
Together AI Fine-Tuning
Models: Llama 3, Mistral, Qwen, and other open-source models
Method: Upload data, select model, configure via API or UI
Cost: Varies by model size. Competitive with self-hosted.
Pros: Open-source models. More control than OpenAI. Can download the fine-tuned model.
Cons: Less mature than OpenAI. Limited customization.
Other Managed Services
Fireworks AI: Fast fine-tuning and inference. Good for LoRA adapters. Supports adapter serving (multiple LoRAs on one base model).

Anyscale (now part of Databricks): Ray-based training platform. Good for large-scale distributed training.

AWS SageMaker JumpStart: One-click fine-tuning for popular models. Integrated with AWS ecosystem.

Google Vertex AI: Fine-tune Gemini models through the API. Also supports open-source models on GKE.
When to use managed services: (1) You don't want to manage GPUs. (2) You need production reliability. (3) Your team lacks ML infrastructure expertise. (4) You're fine-tuning closed models (GPT-4o). When to self-host: (1) You need full control. (2) Cost optimization matters. (3) Data privacy requirements. (4) You need DPO/RLHF/custom methods.
laptop_mac
Local & Consumer Hardware
Fine-tuning on your own machine
Consumer GPUs for Fine-Tuning
NVIDIA RTX 4090 (24 GB): Best consumer GPU for fine-tuning. QLoRA on 7B-13B models. Can handle 70B with aggressive quantization. ~$1,600.

NVIDIA RTX 3090 (24 GB): Previous gen but still capable. QLoRA on 7B models. No bf16 support (use fp16). ~$800 used.

NVIDIA RTX 4080 (16 GB): QLoRA on 7B models (tight). Not recommended for larger models.

Apple M-series (unified memory): MLX framework supports fine-tuning. M2 Ultra (192 GB) can fine-tune 70B models. Slower than NVIDIA but no VRAM limit. Good for experimentation.
Free Cloud Options
Google Colab (free): T4 GPU (16 GB). QLoRA on 7B models. Limited to ~12 hours per session. Good for learning.

Google Colab Pro ($10/mo): A100 40GB access. Much better for fine-tuning. Longer sessions.

Kaggle Notebooks (free): 2x T4 GPUs (16 GB each). 30 hours/week. Good for competitions and experimentation.

Lightning AI Studios (free tier): Free GPU access with persistent storage. Good for development.
The QLoRA revolution: Before QLoRA (2023), fine-tuning 7B models required 80+ GB of GPU memory. Now, QLoRA fits on a 16 GB consumer GPU. This democratized fine-tuning. You can prototype on Colab, iterate on a 4090, and scale to cloud A100s for production training.
checklist
Choosing Your Stack
Recommendations based on your situation
Beginner / Learning
Hardware: Google Colab (free) or Kaggle
Framework: Unsloth (fastest, notebook-friendly)
Tracking: WandB free tier
Cost: $0
Individual Practitioner
Hardware: RTX 4090 (local) or RunPod (cloud)
Framework: TRL + PEFT (maximum control) or Axolotl (config-driven)
Tracking: WandB
Cost: $0-$50/month
Startup / Small Team
Hardware: Lambda Labs or RunPod (multi-GPU)
Framework: Axolotl or raw TRL + Accelerate
Tracking: WandB Teams
Cost: $200-$2,000/month
Enterprise
Hardware: AWS SageMaker, Azure ML, or GCP Vertex AI
Framework: Custom TRL scripts or managed fine-tuning
Tracking: WandB Enterprise or MLflow
Cost: $2,000+/month
Just Want It Done
Closed models: OpenAI fine-tuning API (simplest)
Open models: Together AI fine-tuning API
Tracking: Built into the platform
Cost: Pay per training token
Start simple, scale up. Begin with Colab + Unsloth to learn. Move to Axolotl or TRL when you need more control. Use cloud GPUs when local hardware is insufficient. Use managed services when you don't want to manage infrastructure. The tools are mature enough that the bottleneck is usually data quality, not infrastructure.