Ch 5: Explainability & Interpretability — AI Ethics & Responsible AI

Ch 5 — Explainability & Interpretability

SHAP, LIME, model cards, the accuracy-interpretability trade-off, and the right to an explanation

Index

High Level

psychology

Why

arrow_forward

linear_scale

Intrinsic

arrow_forward

science

SHAP

arrow_forward

blur_on

LIME

arrow_forward

description

Cards

arrow_forward

gavel

Rights

Click play or press Space to begin...

Step- / 8

psychology

Why Explainability Matters

The black box problem and the need for transparency

The Problem

Modern ML models — deep neural networks, gradient-boosted ensembles, LLMs — are black boxes. They make predictions, but we can’t easily understand why. This matters because: Trust — doctors won’t follow an AI diagnosis they can’t understand. Debugging — if you can’t explain why the model made a mistake, you can’t fix it. Fairness — you can’t detect bias if you can’t see what features drive decisions. Regulation — the EU AI Act requires explainability for high-risk AI; GDPR grants a “right to explanation” for automated decisions. Accountability — when an AI denies someone a loan, they deserve to know why. Two related concepts: interpretability (the model is inherently understandable) and explainability (a separate method explains a black-box model).

Interpretability Spectrum

// Interpretability spectrum Inherently interpretable: Linear regression: coefficients Decision trees: rules Rule-based systems: if-then // You can read the model Partially interpretable: Small random forests: feature importance Logistic regression: odds ratios GAMs: shape functions // Some insight into the model Black box (needs explanation): Deep neural networks Gradient-boosted ensembles (XGBoost) Large language models // Need external explanation methods Two approaches: 1. Use interpretable models (simpler) 2. Explain black-box models (SHAP, LIME) // Trade-off: accuracy vs. transparency

Key insight: The accuracy-interpretability trade-off is often overstated. For tabular data, interpretable models (logistic regression, GAMs, small trees) often perform within 1–2% of black-box models. Always try an interpretable model first.

science

SHAP

Shapley Additive exPlanations — game-theory-based feature attribution

How SHAP Works

SHAP (Lundberg & Lee, 2017) uses Shapley values from cooperative game theory to assign each feature a contribution to the prediction. The Shapley value of a feature is the average marginal contribution of that feature across all possible feature combinations. Key properties: consistency (if a feature’s contribution increases, its SHAP value increases), local accuracy (SHAP values sum to the difference between the prediction and the average prediction), and missingness (missing features get zero attribution). SHAP provides both local explanations (why this specific prediction) and global explanations (which features matter most overall). Variants: TreeSHAP (fast for tree models), DeepSHAP (for neural networks), KernelSHAP (model-agnostic).

SHAP Example

# SHAP: explain a prediction import shap # For tree-based models (fast) explainer = shap.TreeExplainer(model) shap_values = explainer.shap_values(X_test) # Local explanation (one prediction) shap.force_plot( explainer.expected_value, shap_values[0], X_test.iloc[0] ) # → "Income pushed prediction UP by 0.3" # → "Age pushed prediction DOWN by 0.1" # Global explanation (all predictions) shap.summary_plot(shap_values, X_test) # → Feature importance + direction # For any model (slower) explainer = shap.KernelExplainer( model.predict, X_train[:100]) shap_values = explainer.shap_values(X_test)

Key insight: SHAP is the gold standard for feature attribution because of its strong theoretical guarantees (consistency, local accuracy). Use TreeSHAP for tree models (instant) and KernelSHAP for any other model (slower but universal).

blur_on

LIME

Local Interpretable Model-Agnostic Explanations

How LIME Works

LIME (Ribeiro et al., 2016) explains individual predictions by creating a local approximation of the black-box model. For a given prediction: (1) generate perturbed samples around the input, (2) get the black-box model’s predictions for these samples, (3) fit a simple, interpretable model (linear regression) to these local predictions, (4) use the interpretable model’s coefficients as the explanation. LIME is model-agnostic (works with any model) and provides local explanations (specific to one prediction). Variants: LimeTabular (structured data), LimeText (NLP — highlights important words), LimeImage (computer vision — highlights important regions). Limitation: explanations can be inconsistent — running LIME twice on the same input may give different explanations (65–75% feature ranking overlap).

LIME Example

# LIME: explain a prediction from lime.lime_tabular import LimeTabularExplainer explainer = LimeTabularExplainer( X_train.values, feature_names=feature_names, class_names=["rejected", "approved"], ) # Explain one prediction exp = explainer.explain_instance( X_test.iloc[0].values, model.predict_proba, num_features=5, ) exp.show_in_notebook() # → "income > 50K: +0.25" # → "credit_score > 700: +0.18" # → "employment < 2yr: -0.12" # For text from lime.lime_text import LimeTextExplainer # → Highlights important words

Key insight: LIME is faster and simpler than SHAP but less theoretically grounded. Use LIME for quick, one-off explanations (especially for text and images). Use SHAP when you need consistent, rigorous feature attribution.

compare_arrows

SHAP vs. LIME

When to use which

Comparison

SHAP advantages: theoretically grounded (Shapley values), consistent explanations (same input → same explanation), provides both local and global explanations, fast for tree models (TreeSHAP). LIME advantages: faster for non-tree models, works well for text and images, simpler to understand conceptually, lighter-weight implementation. Choose SHAP when: you need consistent, reproducible explanations; you’re using tree-based models; you need global feature importance; regulatory compliance requires rigorous attribution. Choose LIME when: you need quick, one-off explanations; you’re explaining text or image models; you want intuitive, human-readable explanations; speed matters more than theoretical rigor.

Head-to-Head

// SHAP vs LIME comparison SHAP LIME Theory: Shapley values Perturbation Consist: High (same) Medium (varies) Scope: Local + Global Local only Speed: Fast (trees) Fast (general) Slow (kernel) Data: Tabular best Text/Image best Global: Yes (summary) No Regul: Strong Weaker Use SHAP for: ✓ Tree models (XGBoost, RF) ✓ Regulatory compliance ✓ Global feature importance ✓ Reproducible explanations Use LIME for: ✓ Text classification ✓ Image classification ✓ Quick one-off explanations ✓ Non-technical audiences

Key insight: In practice, use both. SHAP for systematic analysis (global importance, bias detection, compliance) and LIME for individual case explanations (customer-facing, debugging). They complement each other.

description

Model Cards

Standardized documentation for ML models

What Are Model Cards?

Model cards (Mitchell et al., 2019, Google) are standardized documentation for ML models. They serve as a “nutrition label” for AI, communicating what the model does, how it was built, and its limitations. A model card includes: Model details (architecture, training data, version), Intended use (what it’s designed for and what it’s NOT designed for), Performance metrics (accuracy, disaggregated by demographic group), Limitations (known failure modes, edge cases), Ethical considerations (potential harms, bias analysis), and Training data (source, size, known biases). Model cards are required by the EU AI Act for high-risk AI systems and are considered best practice by Google, Hugging Face, and the ML community.

Model Card Template

// Model card structure 1. Model Details: Name, version, architecture Training date, framework License, contact 2. Intended Use: Primary use cases Out-of-scope uses (important!) Users and stakeholders 3. Performance: Overall metrics (accuracy, F1) Disaggregated by group: Gender: M 95%, F 87% Race: W 94%, B 82% Benchmark comparisons 4. Limitations: Known failure modes Edge cases Data gaps 5. Ethical Considerations: Bias analysis results Potential harms Mitigation steps taken 6. Training Data: Source, size, date range Known biases in data

Key insight: The most important section of a model card is “Limitations” and “Out-of-scope uses.” Being honest about what the model can’t do prevents misuse and builds trust. Hugging Face makes model cards easy with their template.

gavel

The Right to Explanation

Legal requirements for AI transparency

Legal Landscape

GDPR Article 22: individuals have the right not to be subject to decisions based solely on automated processing that significantly affect them. They can request “meaningful information about the logic involved.” EU AI Act: high-risk AI systems must be “sufficiently transparent to enable users to interpret the system’s output and use it appropriately.” Penalties: up to €35M or 7% of global turnover. US: no federal AI explainability law yet, but state laws are emerging (California SB-1001 requires disclosure of AI use). Equal Credit Opportunity Act: lenders must provide specific reasons for credit denials — “the algorithm said no” is not sufficient. The trend is clear: explainability is moving from best practice to legal requirement.

Regulatory Requirements

// Explainability regulations GDPR (EU, 2018): Art 22: right to human review Art 13-14: "meaningful information about the logic involved" Penalty: €20M or 4% turnover EU AI Act (2024): High-risk AI: must be transparent "Enable users to interpret output" Model documentation required Penalty: €35M or 7% turnover US ECOA: Credit denials: specific reasons "Algorithm said no" = not enough Must explain key factors California SB-1001: Disclose AI use to consumers Right to opt out of AI decisions // Trend: explainability is becoming // a legal requirement, not just // a nice-to-have

Key insight: The EU AI Act’s high-risk AI obligations (fully enforceable August 2026) will require explainability for AI in healthcare, hiring, credit, education, and law enforcement. Organizations should start implementing explainability now.

smart_toy

Explaining LLMs

The unique challenge of explaining generative AI

LLM Explainability

LLMs present unique explainability challenges: they have billions of parameters, process text in complex ways, and generate free-form output. Current approaches: Attention visualization — show which input tokens the model “attends to” when generating each output token. Intuitive but research shows attention weights don’t always correspond to importance. Chain-of-thought prompting — ask the model to “explain its reasoning step by step.” The explanation is human-readable but may not reflect the model’s actual computation (it’s a post-hoc rationalization). Mechanistic interpretability — reverse-engineer what individual neurons and circuits in the model do. Cutting-edge research (Anthropic, OpenAI) but not yet practical for most teams. RAG attribution — for retrieval-augmented generation, cite the source documents that informed the answer.

LLM Explanation Methods

// Explaining LLM outputs 1. Chain-of-Thought: "Let me think step by step..." ✓ Human-readable reasoning ✗ May not reflect actual computation ✗ Post-hoc rationalization 2. Attention Visualization: Show attention weights per token ✓ Intuitive visualization ✗ Attention ≠ importance (debated) 3. RAG Attribution: "Based on [Source 1] and [Source 2]" ✓ Verifiable citations ✓ Users can check sources ✗ Only works for RAG systems 4. Mechanistic Interpretability: Reverse-engineer neurons/circuits ✓ True understanding of model ✗ Cutting-edge research only ✗ Not practical yet // Best practice: combine CoT + RAG // for practical LLM explainability

Key insight: Chain-of-thought explanations are useful for users but should not be trusted as ground truth about the model’s reasoning. The model generates plausible-sounding explanations, but they may not reflect its actual computation. Treat them as helpful, not authoritative.

checklist

Practical Guidelines

Building explainability into your ML workflow

Implementation Guide

Step 1: Start with an interpretable model (logistic regression, decision tree, GAM). Only use a black box if the accuracy gain justifies the transparency loss. Step 2: For black-box models, add SHAP explanations. Use TreeSHAP for tree models, KernelSHAP for others. Generate both local (per-prediction) and global (feature importance) explanations. Step 3: Create a model card documenting the model, its performance (disaggregated), limitations, and ethical considerations. Step 4: For user-facing systems, provide explanations in plain language (“Your application was declined primarily because of insufficient credit history”). Step 5: For LLMs, use chain-of-thought + RAG attribution. Step 6: Log explanations alongside predictions for audit trails.

Explainability Checklist

// Explainability implementation checklist Before training: □ Try interpretable model first □ Document intended use and limits □ Define explanation requirements During development: □ Add SHAP/LIME explanations □ Generate global feature importance □ Test explanations for consistency □ Create model card At deployment: □ User-facing explanations (plain language) □ Log explanations with predictions □ Appeal mechanism for affected users □ Audit trail for regulators For LLMs: □ Chain-of-thought reasoning □ Source attribution (RAG) □ Confidence indicators □ "I don't know" capability Ongoing: □ Monitor explanation quality □ Update model card with findings □ Review for regulatory compliance

Key insight: Explainability is not just a technical feature — it’s a design choice. The best explanation is one that helps the specific audience (user, developer, regulator, auditor) make better decisions. Different audiences need different explanations.

arrow_back Ch 4: Bias Mitigation Techniques Ch 6: Privacy & Data Rights arrow_forward