Ch 4: Data Pipelines & Feature Stores

Ch 4 — Data Pipelines & Feature Stores

ETL for ML, Feast, Tecton, feature engineering at scale, and data quality checks

Index

High Level

storage

Raw Data

arrow_forward

transform

Transform

arrow_forward

verified

Validate

arrow_forward

view_column

Features

arrow_forward

inventory_2

Store

arrow_forward

dns

Serve

Click play or press Space to begin...

Step- / 8

storage

Data Pipelines for ML

ETL is different when the consumer is a model

ML-Specific ETL

Traditional ETL (Extract, Transform, Load) moves data for dashboards and reports. ML pipelines have different requirements: they need point-in-time correctness (features must reflect what was known at prediction time, not the future), reproducibility (same input → same features every time), dual serving (same features for training and inference), and freshness (real-time features for online predictions). Common orchestrators: Apache Airflow (the standard for batch pipelines), Prefect (modern Python-native alternative), Dagster (asset-based, great for data-aware orchestration). For streaming: Apache Kafka + Flink or Spark Structured Streaming.

ML Pipeline Architecture

// ML data pipeline Sources: Databases (PostgreSQL, MongoDB) APIs (REST, webhooks) Streams (Kafka, Kinesis) Files (S3, GCS) Orchestrator: Airflow / Prefect / Dagster Schedule: daily, hourly, or event-triggered DAG: extract → validate → transform → load Transform: dbt / Spark / Pandas Raw → cleaned → features Point-in-time joins (critical!) Output: Offline store → training (Parquet/BigQuery) Online store → serving (Redis/DynamoDB)

Key insight: The #1 ML pipeline bug is data leakage through time. If your training features include information from the future (relative to the prediction timestamp), your model will look great in testing but fail in production.

error

Training-Serving Skew

The silent killer of ML systems

The Problem

Training-serving skew occurs when the features used during training differ from those used during inference. This happens when: training uses a Python script but serving uses a Java service (different implementations of the same logic), training uses historical data but serving uses real-time data (different data sources), or training computes features in batch but serving computes them on-the-fly (different computation paths). Even tiny differences — like a different rounding method or timezone handling — can cause significant prediction errors. Feature stores solve this by providing a single source of truth for features in both training and serving.

Skew Examples

// Training-serving skew examples 1. Code Skew: Training: Python → np.mean(values) Serving: Java → average(values) // Different NaN handling! 2. Data Skew: Training: user_age from data warehouse Serving: user_age from user profile API // Warehouse updates daily, API is live 3. Time Skew: Training: avg_spend_30d at batch time Serving: avg_spend_30d at request time // Different windows! Solution: Feature store One definition → offline + online Same code, same data, same result

Key insight: Training-serving skew is insidious because the model still returns predictions — they’re just wrong. There’s no error message. The only way to catch it is monitoring prediction distributions and comparing them to training distributions.

view_column

What Is a Feature Store?

A centralized system for managing ML features

Feature Store Architecture

A feature store is a centralized system that manages the lifecycle of ML features: definition, computation, storage, and serving. It has three layers: (1) Transformation layer — defines how features are computed from raw data (SQL, Spark, or Python). (2) Storage layer — an offline store (data warehouse/lake for training) and an online store (low-latency key-value store for serving). (3) Serving layer — APIs to retrieve features for both training (batch) and inference (real-time). The key guarantee: the same feature definition produces the same value in both training and serving, eliminating training-serving skew.

Feature Store Architecture

// Feature store architecture ┌──────────────────────────────┐ │ Feature Definitions │ │ (SQL / Spark / Python) │ └──────────┬───────────────────┘ ▼ ┌──────────────────────────────┐ │ Offline Store │ │ BigQuery / Snowflake / S3 │ │ → Training data (batch) │ ├──────────────────────────────┤ │ Online Store │ │ Redis / DynamoDB / Cassandra │ │ → Serving data (real-time) │ └──────────┬───────────────────┘ ▼ ┌──────────────────────────────┐ │ Serving API │ │ get_features(entity_id) │ │ → Same result, both paths │ └──────────────────────────────┘

Key insight: The offline store holds the full history (for training with point-in-time joins), while the online store holds only the latest values (for low-latency serving). Both are computed from the same feature definitions.

hub

Feast: Open-Source Feature Store

The most widely adopted open-source feature store

Feast Overview

Feast (Feature Store) is an open-source feature store originally developed at Gojek and now maintained by Tecton. It provides: feature definitions in Python (declarative), offline serving via point-in-time joins against your data warehouse, online serving via Redis/DynamoDB/SQLite for low-latency lookups, and materialization (syncing features from offline to online store). Feast is lightweight — it doesn’t run its own compute; it integrates with your existing data infrastructure. You define features, Feast handles the rest: storage, serving, and consistency.

Feast Example

# feature_repo/features.py from feast import Entity, FeatureView, Field from feast.types import Float32, Int64 user = Entity(name="user_id", join_keys=["user_id"]) user_features = FeatureView( name="user_features", entities=[user], schema=[ Field(name="avg_spend_30d", dtype=Float32), Field(name="txn_count_7d", dtype=Int64), Field(name="days_since_signup", dtype=Int64), ], source=bigquery_source, ttl=timedelta(days=1), ) # Training: point-in-time join training_df = store.get_historical_features( entity_df=entities_with_timestamps, features=["user_features:avg_spend_30d"], ).to_df() # Serving: real-time lookup features = store.get_online_features( features=["user_features:avg_spend_30d"], entity_rows=[{"user_id": 123}], ).to_dict()

Key insight: Feast’s get_historical_features() performs point-in-time joins automatically — for each entity at each timestamp, it returns the feature values that were known at that time, preventing data leakage.

business

Tecton: Managed Feature Platform

Enterprise-grade feature store with built-in compute

Tecton vs Feast

Tecton is a managed feature platform (commercial, founded by the creators of Feast) that goes beyond Feast in several ways: built-in compute (Tecton runs the transformations; Feast relies on external compute), real-time features (stream processing from Kafka/Kinesis with sub-second latency), aggregation engine (optimized time-windowed aggregations like “average spend in last 30 days”), and feature monitoring (drift detection on feature values). Tecton is the right choice for teams that need real-time streaming features, enterprise SLAs, and don’t want to manage infrastructure. Feast is better for teams that want open-source flexibility and already have compute infrastructure.

Feast vs Tecton

// Feast vs Tecton Feast Tecton License: Open-source Commercial Compute: External Built-in Streaming: Limited Native (Kafka) Aggregation: Manual Optimized engine Monitoring: No Yes (drift) Infra: Self-managed Fully managed Cost: Free $$$ (enterprise) Best for: Small teams Enterprise ML // Other options: // Databricks Feature Store (Databricks) // Vertex AI Feature Store (GCP) // SageMaker Feature Store (AWS)

Key insight: If you’re on a major cloud platform, consider their built-in feature store first (Databricks, Vertex AI, SageMaker). They integrate tightly with the rest of the ML stack and reduce operational overhead.

verified

Data Quality & Validation

Catching bad data before it reaches the model

Data Validation Tools

Bad data is the #1 cause of ML failures in production. Data validation catches issues before they reach the model: Great Expectations (the most popular open-source data validation framework — define “expectations” like “column X should never be null” or “values should be between 0 and 1”), Pandera (lightweight schema validation for Pandas DataFrames), TensorFlow Data Validation (TFDV) (detects anomalies and schema drift), and dbt tests (SQL-based data quality checks in your transformation pipeline). Run these checks at every stage: ingestion, transformation, and before training.

Great Expectations Example

import great_expectations as gx context = gx.get_context() ds = context.sources.add_pandas("train") asset = ds.add_dataframe_asset("data") # Define expectations suite = context.add_expectation_suite("train_suite") suite.add_expectation( gx.expectations.ExpectColumnValuesToNotBeNull( column="user_id" ) ) suite.add_expectation( gx.expectations.ExpectColumnValuesToBeBetween( column="age", min_value=0, max_value=120 ) ) suite.add_expectation( gx.expectations.ExpectColumnMeanToBeBetween( column="spend", min_value=10, max_value=500 ) ) # Run validation → pass/fail result = context.run_checkpoint("train_check") assert result.success

Key insight: Add data validation as a gate in your pipeline. If validation fails, the pipeline stops and alerts the team. Never let unvalidated data reach the model — garbage in, garbage out.

bolt

Real-Time vs. Batch Features

Choosing the right computation pattern

Feature Computation Patterns

Features fall into three categories by freshness: Batch features (computed daily/hourly from the data warehouse — e.g., “average spend in last 30 days”). Streaming features (computed in real-time from event streams — e.g., “number of transactions in last 5 minutes”). On-demand features (computed at request time from the input — e.g., “is this IP address in a known VPN list?”). Most ML systems use a mix: 80% batch features (cheap, easy) + 15% streaming features (for freshness-sensitive signals) + 5% on-demand features (for request-specific context). Start with batch; add streaming only when freshness is proven to improve model performance.

Feature Types

// Feature computation patterns Batch (daily/hourly): Source: Data warehouse (BigQuery, Snowflake) Compute: Spark, dbt, SQL Latency: hours Example: avg_spend_30d, lifetime_value Cost: $ (cheapest) Streaming (seconds/minutes): Source: Kafka, Kinesis Compute: Flink, Spark Streaming Latency: seconds Example: txn_count_5min, velocity_score Cost: $$$ (expensive infra) On-Demand (at request time): Source: Request payload + lookups Compute: Application code Latency: milliseconds Example: is_vpn, device_fingerprint Cost: $$ (per-request compute)

Key insight: Don’t build streaming features unless you have evidence they improve model performance. Batch features cover 80% of use cases and are 10x cheaper to build and maintain.

Feature Reuse & Discovery

The feature store as a shared asset

Feature Sharing

One of the biggest ROI drivers of a feature store is feature reuse. Without a feature store, every team independently computes “average spend in last 30 days” — with slightly different implementations. With a feature store, the feature is defined once, computed once, and shared across all models. This reduces: duplicated compute (one pipeline instead of five), inconsistency (one definition, one result), and onboarding time (new models can browse and reuse existing features). A feature catalog (like a data catalog but for features) lets data scientists search for existing features before building new ones.

Feature Reuse Impact

// Without feature store Fraud team: avg_spend_30d (Python, daily) Risk team: avg_spend_30d (SQL, hourly) Marketing: avg_spend_30d (Spark, weekly) // 3 implementations, 3 different values! // With feature store Feature: user_features:avg_spend_30d Definition: one SQL query Compute: once, daily Used by: fraud, risk, marketing models // 1 implementation, 1 value, 3 consumers // Feature catalog: // "I need a user spending feature" // → Search → Found! Already exists // → Reuse instead of rebuild

Key insight: The feature store pays for itself when the second team reuses a feature. At scale, organizations report 60–80% of features in new models come from the existing feature store, dramatically reducing time-to-production.

arrow_back Ch 3: Model Registry & Versioning Ch 5: CI/CD for Machine Learning arrow_forward