Ch 2: Experiment Tracking — MLOps & LLMOps

Ch 2 — Experiment Tracking

MLflow, Weights & Biases, experiment logging, hyperparameter tracking, and reproducibility

Index

High Level

help

Why Track

arrow_forward

edit_note

What to Log

arrow_forward

hub

MLflow

arrow_forward

monitoring

W&B

arrow_forward

compare

Compare

arrow_forward

replay

Reproduce

Click play or press Space to begin...

Step- / 8

help

Why Track Experiments?

The spreadsheet-of-results anti-pattern

The Problem

Every data scientist has been here: you run 50 training experiments over two weeks. You tweak learning rates, architectures, data preprocessing. Results go into a spreadsheet, a Slack message, or worse — nowhere. Three months later, your manager asks: “Can you reproduce the model we shipped?” And you can’t. You don’t remember which hyperparameters, which data version, which code commit produced that model. Experiment tracking solves this by automatically logging every training run with its parameters, metrics, code version, data version, and output artifacts.

Without vs With Tracking

// Without experiment tracking Run 1: lr=0.01, acc=0.82 // in Slack Run 2: lr=0.001, acc=0.87 // in notebook Run 3: lr=???, acc=0.91 // lost // 3 months later: "which was Run 3?" // With experiment tracking Run 3: params: {lr: 0.0005, batch: 64, epochs: 50} metrics: {acc: 0.91, loss: 0.23, f1: 0.89} code: git commit a3f7b2c data: v2.3 (DVC hash: 8e4f...) artifact: model.pt (sha256: c9d1...) env: Python 3.11, PyTorch 2.3

Key insight: Experiment tracking is the foundation of reproducibility. If you can’t reproduce a result, you can’t debug it, improve it, or trust it. It’s the single most impactful MLOps practice to adopt first.

edit_note

What to Log

The five pillars of experiment metadata

Five Pillars

A complete experiment record captures five things: (1) Parameters — hyperparameters, model architecture choices, preprocessing settings. (2) Metrics — training loss, validation accuracy, F1, latency, throughput — logged over time, not just final values. (3) Artifacts — model weights, plots, confusion matrices, sample predictions. (4) Code version — the exact Git commit (or notebook snapshot) that produced the run. (5) Environment — Python version, library versions, hardware (GPU type, memory). Missing any one of these makes reproduction unreliable.

Logging Checklist

// What to log for every experiment 1. Parameters: learning_rate, batch_size, epochs, model_type, hidden_dim, dropout, optimizer, scheduler, seed 2. Metrics (over time): train_loss, val_loss, val_accuracy, val_f1, val_auc, inference_latency_ms 3. Artifacts: model.pt, confusion_matrix.png, predictions_sample.csv, training_curves.png 4. Code: git_commit_hash, git_branch, git_diff 5. Environment: python_version, torch_version, cuda_version, gpu_type, requirements.txt

Key insight: Log metrics over time (every epoch or every N steps), not just the final value. Training curves reveal overfitting, learning rate issues, and convergence problems that a single number hides.

hub

MLflow

The open-source standard for experiment tracking

Overview

MLflow (by Databricks) is the most widely adopted open-source experiment tracking platform. It has four components: MLflow Tracking (log parameters, metrics, artifacts), MLflow Projects (package code for reproducibility), MLflow Models (standard model packaging format), and MLflow Model Registry (stage and promote models). MLflow is free, self-hostable, and integrates with every major ML framework. Its API is straightforward: mlflow.log_param(), mlflow.log_metric(), mlflow.log_artifact(). The UI is functional but basic compared to W&B.

MLflow Example

import mlflow mlflow.set_experiment("fraud-detection") with mlflow.start_run(): # Log parameters mlflow.log_param("lr", 0.001) mlflow.log_param("batch_size", 64) mlflow.log_param("model", "resnet50") # Log metrics over time for epoch in range(50): train(model) loss, acc = evaluate(model) mlflow.log_metric("val_loss", loss, step=epoch) mlflow.log_metric("val_acc", acc, step=epoch) # Log model artifact mlflow.pytorch.log_model(model, "model") # Auto-log git commit mlflow.set_tag("mlflow.source.git.commit", sha)

Key insight: MLflow’s autolog() feature can automatically capture parameters and metrics for PyTorch, TensorFlow, scikit-learn, and XGBoost with a single line: mlflow.autolog().

monitoring

Weights & Biases (W&B)

Best-in-class visualization and collaboration

Overview

Weights & Biases is a SaaS-first experiment tracking platform with the best visualization UI in the industry. Key features: real-time dashboards with automatic training curve plots, W&B Sweeps for hyperparameter optimization (grid, random, Bayesian), W&B Tables for logging and comparing model predictions alongside inputs and ground truth, and W&B Artifacts for dataset and model versioning with full lineage tracking. The free tier is generous for individuals and small teams. The trade-off: it’s SaaS-first, so your data goes to W&B servers (a self-hosted option exists but requires significant infrastructure).

W&B Example

import wandb wandb.init( project="fraud-detection", config={ "lr": 0.001, "batch_size": 64, "model": "resnet50", } ) for epoch in range(50): train(model) loss, acc = evaluate(model) wandb.log({ "val_loss": loss, "val_acc": acc, "epoch": epoch, }) # Log predictions table table = wandb.Table(columns=["input", "pred", "true"]) wandb.log({"predictions": table}) wandb.finish()

Key insight: W&B’s killer feature is W&B Sweeps — define a search space in YAML, and W&B will launch and coordinate hyperparameter searches across multiple machines using Bayesian optimization.

compare

MLflow vs. W&B: Choosing

Open-source flexibility vs. SaaS polish

Decision Framework

Choose MLflow if: you need self-hosting for data privacy, you’re in the Databricks ecosystem, you want full control over infrastructure, or budget is a hard constraint. Choose W&B if: you prioritize visualization and team collaboration, you want hyperparameter sweeps built in, you’re comfortable with SaaS, or you want minimal setup. Other options: Neptune.ai (strong metadata management), Comet ML (good code diffing), ClearML (open-source with orchestration). Many teams use both — MLflow for the model registry and W&B for visualization.

Comparison

// MLflow vs W&B MLflow W&B License: Open-source Freemium SaaS Hosting: Self-hosted Cloud (+ self) UI: Functional Best-in-class Sweeps: No Yes (Bayesian) Tables: No Yes Autolog: Yes Yes Registry: Yes Yes (Artifacts) Cost: Free Free tier, paid Ecosystem: Databricks Framework-agnostic Setup: Medium Low

Key insight: The best experiment tracker is the one your team actually uses. Start with one tool, enforce it as a team standard, and don’t let anyone go back to spreadsheets or Slack messages for tracking results.

tune

Hyperparameter Optimization

Systematic search beats random guessing

Search Strategies

Hyperparameter tuning is a core part of experimentation. Grid search tries every combination — exhaustive but exponentially expensive. Random search (Bergstra & Bengio, 2012) samples randomly and is surprisingly effective because most hyperparameters have low “effective dimensionality” — only 1–2 matter most. Bayesian optimization builds a probabilistic model of the objective function and intelligently picks the next point to evaluate. Optuna (open-source) and W&B Sweeps are the most popular tools. For LLMs, hyperparameter tuning is less common — prompt engineering and RAG configuration are the new “tuning.”

Optuna Example

import optuna def objective(trial): lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True) batch = trial.suggest_categorical( "batch", [16, 32, 64, 128] ) dropout = trial.suggest_float( "dropout", 0.1, 0.5 ) model = build_model(lr, batch, dropout) acc = train_and_evaluate(model) return acc study = optuna.create_study(direction="maximize") study.optimize(objective, n_trials=100) print(study.best_params) # {'lr': 0.00037, 'batch': 64, 'dropout': 0.2}

Key insight: Random search with 60 trials will find a configuration within the top 5% of the search space with 95% probability. Start with random search before investing in Bayesian optimization.

replay

Reproducibility

Making experiments deterministic and repeatable

The Reproducibility Checklist

Full reproducibility requires controlling five sources of randomness: (1) Random seeds — set seeds for Python, NumPy, PyTorch, and CUDA. (2) Data ordering — shuffle with a fixed seed; version the dataset. (3) Code version — pin to a Git commit. (4) Environment — use Docker or conda lock files to freeze all dependencies. (5) Hardware — GPU floating-point operations are non-deterministic by default; use torch.use_deterministic_algorithms(True) for exact reproducibility (at a performance cost). In practice, “close enough” reproducibility (within 0.1% of metrics) is usually sufficient.

Reproducibility Setup

import random, numpy, torch def set_seed(seed=42): random.seed(seed) numpy.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) # For exact reproducibility (slower) torch.use_deterministic_algorithms(True) torch.backends.cudnn.benchmark = False # Pin environment # pip freeze > requirements.txt # OR use conda-lock / poetry.lock # OR use Docker with pinned base image # Version data # dvc add data/train.csv # git add data/train.csv.dvc # git commit -m "data v2.3"

Key insight: Perfect bit-for-bit reproducibility across different GPUs is nearly impossible due to non-deterministic CUDA operations. Aim for “statistical reproducibility” — results within a small tolerance across runs.

checklist

Experiment Tracking Best Practices

Lessons from teams that do it well

Best Practices

1. Track from day one — don’t wait until you have “real” experiments. 2. Use autologging — MLflow and W&B both support automatic parameter/metric capture. 3. Tag experiments — add tags like “baseline,” “production,” “ablation” for easy filtering. 4. Log negative results — failed experiments are valuable data; they prevent others from repeating mistakes. 5. Review as a team — weekly experiment review meetings where the team looks at the tracking dashboard together. 6. Connect to CI — automated training runs should log to the same tracker as manual experiments.

Team Workflow

// Experiment tracking team workflow 1. Naming Convention: {project}/{experiment}/{run_name} fraud/baseline/lr-sweep-v2 2. Required Tags: type: [baseline|ablation|sweep|prod] owner: [name] dataset_version: [v2.3] 3. Weekly Review: Dashboard walkthrough (15 min) Top 3 runs → discuss Failed runs → document why 4. Promotion Flow: Experiment → best run → register model → staging → validation → production

Key insight: The experiment tracker becomes the team’s “lab notebook.” When a new team member joins, they can browse the full history of what was tried, what worked, and what didn’t — invaluable institutional knowledge.

arrow_back Ch 1: Why MLOps Matters Ch 3: Model Registry & Versioning arrow_forward