Ch 3: Model Registry & Versioning

Ch 3 — Model Registry & Versioning

Model artifacts, versioning strategies, staging/production promotion, and DVC for data

Index

High Level

code

Code (Git)

arrow_forward

database

Data (DVC)

arrow_forward

model_training

Model

arrow_forward

inventory_2

Registry

arrow_forward

verified

Staging

arrow_forward

rocket_launch

Production

Click play or press Space to begin...

Step- / 8

help

The Three Artifacts of ML

Code, data, and models — all must be versioned

Why Version Everything

In traditional software, you version one thing: code (with Git). In ML, you must version three things: code, data, and models. A model is a function of its code and its training data. Change either one, and you get a different model. If you can’t trace back from a production model to the exact code commit and data version that produced it, you can’t debug it, audit it, or reproduce it. This lineage — the chain from data → code → model → deployment — is the backbone of ML governance.

Three Artifacts

// The three artifacts of ML 1. Code → Git train.py, preprocess.py, config.yaml Version: git commit hash 2. Data → DVC / LakeFS train.csv (100GB), images/ (50GB) Version: content hash (SHA-256) Storage: S3, GCS, Azure Blob 3. Model → Model Registry model.pt, model.onnx, tokenizer/ Version: auto-incremented (v1, v2, v3) Metadata: params, metrics, lineage // Lineage chain: // data v2.3 + code a3f7b2c → model v7

Key insight: Git alone is not enough for ML. Git handles code well but cannot handle large binary files (datasets, model weights). You need specialized tools for data (DVC) and models (model registry).

database

DVC: Data Version Control

Git for data — lightweight, open-source, cloud-native

How DVC Works

DVC (Data Version Control) is an open-source tool that extends Git to handle large files. Instead of storing data in Git, DVC stores a small metafile (.dvc) in Git that contains the hash of the actual data. The data itself lives in remote storage (S3, GCS, Azure Blob, or local). When you dvc checkout, DVC fetches the correct data version from storage. This means you can git checkout v2.3 and DVC will automatically give you the matching data. DVC also supports pipelines (dvc.yaml) that define the DAG of data processing steps, making the entire pipeline reproducible.

DVC Workflow

# Initialize DVC in a Git repo $ dvc init $ dvc remote add -d s3 s3://my-bucket/dvc # Track a large dataset $ dvc add data/train.csv # Creates data/train.csv.dvc (small metafile) # Adds data/train.csv to .gitignore # Commit metafile to Git $ git add data/train.csv.dvc .gitignore $ git commit -m "data v2.3" # Push data to remote storage $ dvc push # Later: get data for a specific version $ git checkout v2.3 $ dvc checkout # → data/train.csv is now v2.3

Key insight: DVC’s metafile approach means your Git history stays small and fast, while your data can be terabytes. The metafile is just a few lines of YAML containing the data’s content hash.

inventory_2

The Model Registry

A centralized catalog of trained models

What Is a Model Registry?

A model registry is a centralized store for trained model artifacts with metadata. Think of it as a “package registry” (like npm or PyPI) but for ML models. Key capabilities: versioning (auto-incrementing version numbers), metadata (parameters, metrics, training data version, code commit), stage management (None → Staging → Production → Archived), aliases (named references like “champion” or “challenger”), and lineage (which experiment run produced this model). MLflow Model Registry is the most widely used open-source option. Cloud platforms (SageMaker, Vertex AI) have their own built-in registries.

MLflow Model Registry

import mlflow # Register a model from an experiment run result = mlflow.register_model( model_uri="runs:/abc123/model", name="fraud-detector" ) # → Creates fraud-detector version 1 # Set alias for deployment client = mlflow.MlflowClient() client.set_registered_model_alias( name="fraud-detector", alias="champion", version=3 ) # Load by alias in serving code model = mlflow.pyfunc.load_model( "models:/fraud-detector@champion" )

Key insight: Aliases decouple deployment from version numbers. Your serving code always loads @champion. To deploy a new model, you just move the alias — no code change needed.

swap_horiz

Model Promotion Workflow

From experiment to staging to production

The Promotion Pipeline

A mature model promotion workflow has clear gates: (1) Experiment — data scientist trains and evaluates, logs to tracker. (2) Registration — best run is registered in the model registry. (3) Staging — model is deployed to a staging environment for integration testing, shadow mode (receives real traffic but predictions are not served), and A/B testing. (4) Validation gates — automated checks: accuracy above threshold, latency within SLA, no regression on critical slices, bias checks pass. (5) Production — model is promoted to production with canary deployment (serve to 5% of traffic first). (6) Archive — old production model is archived but kept for rollback.

Promotion Flow

// Model promotion pipeline Experiment │ Data scientist trains model │ Logs to MLflow/W&B ▼ Register │ Best run → model registry v7 │ Metadata: params, metrics, lineage ▼ Staging │ Deploy to staging environment │ Shadow mode: real traffic, no serving │ Integration tests pass? ▼ Validation Gates │ ✓ Accuracy ≥ 0.92 │ ✓ Latency p99 ≤ 50ms │ ✓ No regression on critical slices │ ✓ Bias audit passes ▼ Production │ Canary: 5% → 25% → 100% │ Alias: @champion → v7 ▼ Archive Old @champion (v6) → archived

Key insight: Never promote a model to production without automated validation gates. Human approval is fine as an additional check, but the automated gates catch regressions that humans miss.

package_2

Model Packaging Formats

How to package a model for deployment

Packaging Options

A “model” in production is more than just weights — it includes preprocessing logic, tokenizers, configuration, and dependencies. Common packaging formats: MLflow Model (framework-agnostic, includes MLmodel descriptor + conda.yaml), ONNX (Open Neural Network Exchange — convert PyTorch/TF models to a portable format for optimized inference), TorchScript (serialized PyTorch models that can run without Python), SavedModel (TensorFlow’s native format), and Docker containers (package everything including the runtime). For LLMs, models are typically served via specialized inference servers (vLLM, TGI) rather than generic packaging.

MLflow Model Format

# MLflow model directory structure fraud-detector/ ├── MLmodel # metadata + flavors ├── conda.yaml # environment ├── requirements.txt # pip deps ├── python_model.pkl # or model.pt └── artifacts/ └── preprocessor.pkl # MLmodel file: artifact_path: model flavors: python_function: loader_module: mlflow.pytorch python_version: 3.11.0 pytorch: model_data: model.pt pytorch_version: 2.3.0 # Load anywhere: # mlflow.pyfunc.load_model("path/to/model")

Key insight: MLflow’s “flavors” system lets you save a model in its native format (PyTorch, sklearn, etc.) while also providing a generic pyfunc interface. This means any MLflow model can be loaded and served the same way, regardless of framework.

alt_route

LakeFS: Git for Data Lakes

Branch, commit, and merge your data like code

Beyond DVC

LakeFS is an open-source tool that provides Git-like operations (branch, commit, merge, revert) directly on your data lake (S3, GCS, Azure Blob). Unlike DVC, which tracks individual files, LakeFS operates at the data lake level — you can branch an entire S3 bucket, make changes, test them, and merge back. This is powerful for: atomic data updates (either all changes apply or none), data CI/CD (test data changes before they reach production), time travel (query data as it was at any point in time), and isolation (each experiment can work on its own branch of the data without affecting others).

DVC vs LakeFS

// DVC vs LakeFS DVC: Tracks individual files/directories Metafiles in Git (.dvc files) Best for: ML projects, small teams Workflow: dvc add → git commit → dvc push LakeFS: Branches entire data lakes Git-like operations on object storage Best for: data teams, data lakes, CI/CD Workflow: lakefs branch → modify → commit # LakeFS example $ lakectl branch create \ lakefs://repo/experiment-42 $ # modify data on the branch $ lakectl commit lakefs://repo/experiment-42 \ -m "add new training samples" $ lakectl merge \ lakefs://repo/experiment-42 \ lakefs://repo/main

Key insight: Use DVC for ML-centric workflows where you version specific datasets alongside code. Use LakeFS for data-platform-level versioning where multiple teams need branching and merging on shared data lakes.

description

Model Cards

Documentation as a first-class artifact

What Is a Model Card?

Model cards (Mitchell et al., Google, 2019) are standardized documentation for ML models. They describe: model details (architecture, training data, intended use), performance metrics (overall and per-slice — e.g., accuracy by demographic group), limitations (known failure modes, out-of-scope uses), ethical considerations (potential biases, fairness analysis), and deployment context (intended users, deployment environment). Model cards should be stored alongside the model in the registry. The EU AI Act and other regulations increasingly require this level of documentation for high-risk AI systems.

Model Card Template

# Model Card: fraud-detector v7 Model Details: Type: XGBoost classifier Training data: transactions_v2.3 (10M rows) Features: 47 engineered features Code: git commit a3f7b2c Performance: Overall accuracy: 0.94 Precision: 0.89 | Recall: 0.91 By region: US: 0.95 | EU: 0.93 | APAC: 0.88 Limitations: - Lower accuracy on APAC transactions - Not tested on crypto transactions - Requires < 50ms latency for real-time Ethical Considerations: - Checked for demographic bias: PASS - No PII in features

Key insight: Model cards are not just nice-to-have documentation — they’re becoming a regulatory requirement. The EU AI Act mandates detailed documentation for high-risk AI systems, and model cards are the industry-standard format.

checklist

Versioning Best Practices

Practical guidelines for teams

Best Practices

1. Version all three artifacts together — a model version should always link to its code commit and data version. 2. Use semantic versioning for data — major (schema change), minor (new rows), patch (corrections). 3. Never overwrite — always create a new version; storage is cheap, debugging is expensive. 4. Automate lineage — your training pipeline should automatically record which data and code produced each model. 5. Tag production models — always know which model version is currently serving in production. 6. Keep rollback models — retain at least the last 3 production models for quick rollback.

Versioning Strategy

// Versioning strategy Code: Git tags (v1.2.3) Major: architecture change Minor: feature/hyperparameter change Patch: bug fix Data: Semantic versioning Major: schema change (columns added/removed) Minor: new data (more rows) Patch: corrections (fix labels) Model: Auto-increment + aliases v1, v2, v3, ... @champion = current production @challenger = A/B test candidate @rollback = previous champion // Golden rule: never delete, always version

Key insight: The combination of Git (code) + DVC (data) + MLflow Model Registry (models) gives you complete lineage from any production prediction back to the exact code, data, and training run that produced it.

arrow_back Ch 2: Experiment Tracking Ch 4: Data Pipelines & Feature Stores arrow_forward