Ch 2 — Experiment Tracking

MLflow, Weights & Biases, experiment logging, hyperparameter tracking, and reproducibility
High Level
help
Why Track
arrow_forward
edit_note
What to Log
arrow_forward
hub
MLflow
arrow_forward
monitoring
W&B
arrow_forward
compare
Compare
arrow_forward
replay
Reproduce
-
Click play or press Space to begin...
Step- / 8
help
Why Track Experiments?
The spreadsheet-of-results anti-pattern
The Problem
Every data scientist has been here: you run 50 training experiments over two weeks. You tweak learning rates, architectures, data preprocessing. Results go into a spreadsheet, a Slack message, or worse — nowhere. Three months later, your manager asks: “Can you reproduce the model we shipped?” And you can’t. You don’t remember which hyperparameters, which data version, which code commit produced that model. Experiment tracking solves this by automatically logging every training run with its parameters, metrics, code version, data version, and output artifacts.
Without vs With Tracking
// Without experiment tracking Run 1: lr=0.01, acc=0.82 // in Slack Run 2: lr=0.001, acc=0.87 // in notebook Run 3: lr=???, acc=0.91 // lost // 3 months later: "which was Run 3?" // With experiment tracking Run 3: params: {lr: 0.0005, batch: 64, epochs: 50} metrics: {acc: 0.91, loss: 0.23, f1: 0.89} code: git commit a3f7b2c data: v2.3 (DVC hash: 8e4f...) artifact: model.pt (sha256: c9d1...) env: Python 3.11, PyTorch 2.3
Key insight: Experiment tracking is the foundation of reproducibility. If you can’t reproduce a result, you can’t debug it, improve it, or trust it. It’s the single most impactful MLOps practice to adopt first.
edit_note
What to Log
The five pillars of experiment metadata
Five Pillars
A complete experiment record captures five things: (1) Parameters — hyperparameters, model architecture choices, preprocessing settings. (2) Metrics — training loss, validation accuracy, F1, latency, throughput — logged over time, not just final values. (3) Artifacts — model weights, plots, confusion matrices, sample predictions. (4) Code version — the exact Git commit (or notebook snapshot) that produced the run. (5) Environment — Python version, library versions, hardware (GPU type, memory). Missing any one of these makes reproduction unreliable.
Logging Checklist
// What to log for every experiment 1. Parameters: learning_rate, batch_size, epochs, model_type, hidden_dim, dropout, optimizer, scheduler, seed 2. Metrics (over time): train_loss, val_loss, val_accuracy, val_f1, val_auc, inference_latency_ms 3. Artifacts: model.pt, confusion_matrix.png, predictions_sample.csv, training_curves.png 4. Code: git_commit_hash, git_branch, git_diff 5. Environment: python_version, torch_version, cuda_version, gpu_type, requirements.txt
Key insight: Log metrics over time (every epoch or every N steps), not just the final value. Training curves reveal overfitting, learning rate issues, and convergence problems that a single number hides.
hub
MLflow
The open-source standard for experiment tracking
Overview
MLflow (by Databricks) is the most widely adopted open-source experiment tracking platform. It has four components: MLflow Tracking (log parameters, metrics, artifacts), MLflow Projects (package code for reproducibility), MLflow Models (standard model packaging format), and MLflow Model Registry (stage and promote models). MLflow is free, self-hostable, and integrates with every major ML framework. Its API is straightforward: mlflow.log_param(), mlflow.log_metric(), mlflow.log_artifact(). The UI is functional but basic compared to W&B.
MLflow Example
import mlflow mlflow.set_experiment("fraud-detection") with mlflow.start_run(): # Log parameters mlflow.log_param("lr", 0.001) mlflow.log_param("batch_size", 64) mlflow.log_param("model", "resnet50") # Log metrics over time for epoch in range(50): train(model) loss, acc = evaluate(model) mlflow.log_metric("val_loss", loss, step=epoch) mlflow.log_metric("val_acc", acc, step=epoch) # Log model artifact mlflow.pytorch.log_model(model, "model") # Auto-log git commit mlflow.set_tag("mlflow.source.git.commit", sha)
Key insight: MLflow’s autolog() feature can automatically capture parameters and metrics for PyTorch, TensorFlow, scikit-learn, and XGBoost with a single line: mlflow.autolog().
monitoring
Weights & Biases (W&B)
Best-in-class visualization and collaboration
Overview
Weights & Biases is a SaaS-first experiment tracking platform with the best visualization UI in the industry. Key features: real-time dashboards with automatic training curve plots, W&B Sweeps for hyperparameter optimization (grid, random, Bayesian), W&B Tables for logging and comparing model predictions alongside inputs and ground truth, and W&B Artifacts for dataset and model versioning with full lineage tracking. The free tier is generous for individuals and small teams. The trade-off: it’s SaaS-first, so your data goes to W&B servers (a self-hosted option exists but requires significant infrastructure).
W&B Example
import wandb wandb.init( project="fraud-detection", config={ "lr": 0.001, "batch_size": 64, "model": "resnet50", } ) for epoch in range(50): train(model) loss, acc = evaluate(model) wandb.log({ "val_loss": loss, "val_acc": acc, "epoch": epoch, }) # Log predictions table table = wandb.Table(columns=["input", "pred", "true"]) wandb.log({"predictions": table}) wandb.finish()
Key insight: W&B’s killer feature is W&B Sweeps — define a search space in YAML, and W&B will launch and coordinate hyperparameter searches across multiple machines using Bayesian optimization.
compare
MLflow vs. W&B: Choosing
Open-source flexibility vs. SaaS polish
Decision Framework
Choose MLflow if: you need self-hosting for data privacy, you’re in the Databricks ecosystem, you want full control over infrastructure, or budget is a hard constraint. Choose W&B if: you prioritize visualization and team collaboration, you want hyperparameter sweeps built in, you’re comfortable with SaaS, or you want minimal setup. Other options: Neptune.ai (strong metadata management), Comet ML (good code diffing), ClearML (open-source with orchestration). Many teams use both — MLflow for the model registry and W&B for visualization.
Comparison
// MLflow vs W&B MLflow W&B License: Open-source Freemium SaaS Hosting: Self-hosted Cloud (+ self) UI: Functional Best-in-class Sweeps: No Yes (Bayesian) Tables: No Yes Autolog: Yes Yes Registry: Yes Yes (Artifacts) Cost: Free Free tier, paid Ecosystem: Databricks Framework-agnostic Setup: Medium Low
Key insight: The best experiment tracker is the one your team actually uses. Start with one tool, enforce it as a team standard, and don’t let anyone go back to spreadsheets or Slack messages for tracking results.
tune
Hyperparameter Optimization
Systematic search beats random guessing
Search Strategies
Hyperparameter tuning is a core part of experimentation. Grid search tries every combination — exhaustive but exponentially expensive. Random search (Bergstra & Bengio, 2012) samples randomly and is surprisingly effective because most hyperparameters have low “effective dimensionality” — only 1–2 matter most. Bayesian optimization builds a probabilistic model of the objective function and intelligently picks the next point to evaluate. Optuna (open-source) and W&B Sweeps are the most popular tools. For LLMs, hyperparameter tuning is less common — prompt engineering and RAG configuration are the new “tuning.”
Optuna Example
import optuna def objective(trial): lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True) batch = trial.suggest_categorical( "batch", [16, 32, 64, 128] ) dropout = trial.suggest_float( "dropout", 0.1, 0.5 ) model = build_model(lr, batch, dropout) acc = train_and_evaluate(model) return acc study = optuna.create_study(direction="maximize") study.optimize(objective, n_trials=100) print(study.best_params) # {'lr': 0.00037, 'batch': 64, 'dropout': 0.2}
Key insight: Random search with 60 trials will find a configuration within the top 5% of the search space with 95% probability. Start with random search before investing in Bayesian optimization.
replay
Reproducibility
Making experiments deterministic and repeatable
The Reproducibility Checklist
Full reproducibility requires controlling five sources of randomness: (1) Random seeds — set seeds for Python, NumPy, PyTorch, and CUDA. (2) Data ordering — shuffle with a fixed seed; version the dataset. (3) Code version — pin to a Git commit. (4) Environment — use Docker or conda lock files to freeze all dependencies. (5) Hardware — GPU floating-point operations are non-deterministic by default; use torch.use_deterministic_algorithms(True) for exact reproducibility (at a performance cost). In practice, “close enough” reproducibility (within 0.1% of metrics) is usually sufficient.
Reproducibility Setup
import random, numpy, torch def set_seed(seed=42): random.seed(seed) numpy.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) # For exact reproducibility (slower) torch.use_deterministic_algorithms(True) torch.backends.cudnn.benchmark = False # Pin environment # pip freeze > requirements.txt # OR use conda-lock / poetry.lock # OR use Docker with pinned base image # Version data # dvc add data/train.csv # git add data/train.csv.dvc # git commit -m "data v2.3"
Key insight: Perfect bit-for-bit reproducibility across different GPUs is nearly impossible due to non-deterministic CUDA operations. Aim for “statistical reproducibility” — results within a small tolerance across runs.
checklist
Experiment Tracking Best Practices
Lessons from teams that do it well
Best Practices
1. Track from day one — don’t wait until you have “real” experiments. 2. Use autologging — MLflow and W&B both support automatic parameter/metric capture. 3. Tag experiments — add tags like “baseline,” “production,” “ablation” for easy filtering. 4. Log negative results — failed experiments are valuable data; they prevent others from repeating mistakes. 5. Review as a team — weekly experiment review meetings where the team looks at the tracking dashboard together. 6. Connect to CI — automated training runs should log to the same tracker as manual experiments.
Team Workflow
// Experiment tracking team workflow 1. Naming Convention: {project}/{experiment}/{run_name} fraud/baseline/lr-sweep-v2 2. Required Tags: type: [baseline|ablation|sweep|prod] owner: [name] dataset_version: [v2.3] 3. Weekly Review: Dashboard walkthrough (15 min) Top 3 runs → discuss Failed runs → document why 4. Promotion Flow: Experiment → best run → register model → staging → validation → production
Key insight: The experiment tracker becomes the team’s “lab notebook.” When a new team member joins, they can browse the full history of what was tried, what worked, and what didn’t — invaluable institutional knowledge.