Ch 3 — Fairness Definitions & Metrics

Demographic parity, equalized odds, calibration, the impossibility theorem, and choosing the right metric
High Level
groups
Groups
arrow_forward
balance
Parity
arrow_forward
equalizer
Eq. Odds
arrow_forward
tune
Calibrate
arrow_forward
block
Impossible
arrow_forward
check_circle
Choose
-
Click play or press Space to begin...
Step- / 8
help
What Does “Fair” Mean?
Fairness has multiple, conflicting definitions
The Core Challenge
“Fairness” seems intuitive, but it has multiple formal definitions that conflict with each other. Consider a loan approval model: is it fair if it approves the same percentage of applicants from each group (demographic parity)? Or if it has the same accuracy for each group (equalized odds)? Or if a “70% approval score” means the same thing regardless of group (calibration)? These sound similar but are mathematically different — and in most real-world scenarios, you cannot satisfy all of them simultaneously. This is the central challenge of algorithmic fairness: choosing which definition of fairness to optimize for is a value judgment, not a technical decision.
Three Views of Fairness
// Three definitions of "fair" Demographic Parity: "Equal outcomes across groups" P(approved | Group A) = P(approved | Group B) // Same approval rate for everyone Equalized Odds: "Equal accuracy across groups" P(approved | qualified, Group A) = P(approved | qualified, Group B) // Same true positive rate for everyone Calibration: "Scores mean the same thing" P(qualified | score=0.7, Group A) = P(qualified | score=0.7, Group B) // A 70% score means 70% for everyone // These CANNOT all be satisfied // simultaneously (impossibility theorem)
Key insight: There is no single “correct” definition of fairness. The choice depends on the context: what kind of harm are you trying to prevent? Who are the stakeholders? What are the consequences of different types of errors?
balance
Demographic Parity
Equal outcomes regardless of group membership
Definition
Demographic parity (also called statistical parity or group fairness) requires that the model’s positive prediction rate is the same across all groups: P(Ŷ=1 | A=a) = P(Ŷ=1 | A=b) for all groups a, b. In plain English: the model should approve (or reject) the same percentage of applicants from each group. Pros: simple to understand and measure, directly addresses representation, aligns with affirmative action goals. Cons: ignores actual qualifications — if Group A has a 90% qualification rate and Group B has a 50% rate, demographic parity requires approving unqualified candidates from Group B or rejecting qualified candidates from Group A. It can reduce overall accuracy and may not be legally defensible in all contexts.
Example
// Demographic parity example: loan approval Without parity: Group A: 100 applicants, 70 approved (70%) Group B: 100 applicants, 40 approved (40%) // Disparity: 70% vs 40% With demographic parity: Group A: 100 applicants, 55 approved (55%) Group B: 100 applicants, 55 approved (55%) // Equal rates: 55% each Problem: If Group A qualification rate = 70% and Group B qualification rate = 40% Then parity means: - Rejecting 15 qualified from Group A - Approving 15 unqualified from Group B // Is this "fair"? Best for: Hiring (equal opportunity goals) Advertising (equal exposure) Resource allocation
Key insight: Demographic parity is the right metric when the goal is equal representation and when you believe the base rate differences between groups are themselves the result of historical discrimination (not genuine differences in qualification).
equalizer
Equalized Odds
Equal error rates across groups
Definition
Equalized odds requires that the model has the same true positive rate (TPR) and false positive rate (FPR) across all groups: P(Ŷ=1 | Y=1, A=a) = P(Ŷ=1 | Y=1, A=b) and P(Ŷ=1 | Y=0, A=a) = P(Ŷ=1 | Y=0, A=b). In plain English: among qualified applicants, the approval rate should be the same across groups. Among unqualified applicants, the rejection rate should also be the same. Equal opportunity is a relaxed version that only requires equal TPR (same approval rate for qualified applicants). Pros: allows different base rates, focuses on accuracy rather than outcomes. Cons: requires ground truth labels, may still produce unequal outcomes if base rates differ.
Example
// Equalized odds example Group A (base rate 70% qualified): Qualified: 70 people Approved: 63 (TPR = 90%) Rejected: 7 Unqualified: 30 people Approved: 3 (FPR = 10%) Rejected: 27 Group B (base rate 40% qualified): Qualified: 40 people Approved: 36 (TPR = 90%) ✓ Rejected: 4 Unqualified: 60 people Approved: 6 (FPR = 10%) ✓ Rejected: 54 // TPR equal (90% = 90%) ✓ // FPR equal (10% = 10%) ✓ // But outcomes differ: 66% vs 42% Best for: Criminal justice (equal error rates) Medical diagnosis (equal sensitivity) Credit scoring (equal accuracy)
Key insight: Equalized odds is the right metric when you care about equal accuracy rather than equal outcomes. It says: “the model should be equally good at its job for everyone” — but it accepts that different base rates lead to different outcomes.
tune
Calibration
Predictions mean the same thing for everyone
Definition
Calibration (also called predictive parity) requires that a predicted probability means the same thing regardless of group: P(Y=1 | S=s, A=a) = P(Y=1 | S=s, A=b) for all scores s. In plain English: if the model gives someone a “70% risk score,” that person should actually have a 70% chance of the outcome, regardless of their group. Pros: intuitive (“scores are honest”), preserves the meaning of risk scores, important for decision-makers who use scores to set thresholds. Cons: can coexist with very different outcomes across groups (if base rates differ, calibrated scores will naturally produce different approval rates). Calibration is the metric most valued by actuaries, insurers, and risk assessors.
Example
// Calibration example Calibrated model: Score = 0.7 for Group A person → 70% chance of repaying loan ✓ Score = 0.7 for Group B person → 70% chance of repaying loan ✓ // Same score = same meaning Uncalibrated model: Score = 0.7 for Group A person → 80% chance of repaying ✗ Score = 0.7 for Group B person → 55% chance of repaying ✗ // Same score ≠ same meaning Problem with calibration alone: If Group A base rate = 70% and Group B base rate = 40% Calibrated scores will naturally be higher for Group A → Different approval rates // Calibration ≠ equal outcomes Best for: Risk assessment (insurance, bail) Medical prognosis Any score-based decision system
Key insight: Calibration is about honesty of scores, not equality of outcomes. A calibrated model can still produce very different outcomes for different groups. If your goal is equal outcomes, calibration alone is insufficient.
block
The Impossibility Theorem
You can’t have it all
The Mathematical Constraint
The impossibility theorem (Chouldechova, 2017; Kleinberg, Mullainathan & Raghavan, 2016) proves that when base rates differ between groups, demographic parity, equalized odds, and calibration cannot all be satisfied simultaneously (except in degenerate cases like a perfect predictor or equal base rates). This is not a limitation of current algorithms — it’s a mathematical impossibility. No future algorithm can overcome it. The implication: you must choose which definition of fairness to prioritize. This is a normative (value-based) decision, not a technical one. Different stakeholders may legitimately disagree on which metric matters most.
The Impossibility
// Impossibility theorem Given: Base rate Group A ≠ Base rate Group B (e.g., 70% vs 40% qualification rate) Then you CANNOT simultaneously have: ✓ Demographic parity (equal rates) ✓ Equalized odds (equal accuracy) ✓ Calibration (honest scores) You must choose: Parity + Calibration → unequal accuracy Odds + Calibration → unequal outcomes Parity + Odds → uncalibrated scores // This is a MATHEMATICAL fact // No algorithm can overcome it // The choice is a VALUE judgment Exception: If base rates are equal, all three can be satisfied simultaneously // But base rates are rarely equal
Key insight: The impossibility theorem is the most important result in algorithmic fairness. It means there is no “perfectly fair” algorithm — only trade-offs. Acknowledging this honestly is the first step toward responsible AI.
person
Individual Fairness
Similar people should be treated similarly
Beyond Group Fairness
All the metrics above are group fairness metrics — they compare outcomes across demographic groups. Individual fairness takes a different approach: similar individuals should receive similar predictions, regardless of group membership. Formally: if individuals x and y are similar (according to a task-relevant distance metric), then the model’s predictions for x and y should also be similar. Counterfactual fairness is a related concept: would the prediction change if the individual belonged to a different group, all else being equal? Pros: doesn’t require defining groups, avoids the “which group?” problem. Cons: requires defining a “similarity metric,” which is itself a value judgment. What makes two people “similar” for a loan? For a job?
Individual vs. Group
// Group vs. individual fairness Group Fairness: "Equal outcomes for groups" P(approved | male) = P(approved | female) // Compares aggregate statistics Individual Fairness: "Similar people → similar outcomes" d(x, y) small → d(f(x), f(y)) small // Compares individual predictions Counterfactual Fairness: "Would the prediction change if the person were in a different group?" f(x | male) ≈ f(x | female) // Hypothetical group swap Challenge: What makes two people "similar"? For loans: income, credit history? For jobs: skills, experience? // Defining similarity is a value choice
Key insight: Individual fairness avoids the “which group?” problem but introduces the “what is similar?” problem. In practice, most teams use group fairness metrics because they’re easier to measure and align with legal requirements (disparate impact).
check_circle
Choosing the Right Metric
A decision framework for practitioners
Decision Framework
Choose your fairness metric based on the context: Use demographic parity when: the goal is equal representation, base rate differences are due to historical discrimination, or the application is resource allocation (advertising, opportunities). Use equalized odds when: accuracy matters equally for all groups, the application has high-stakes consequences (criminal justice, medical diagnosis), or you need to minimize both false positives and false negatives across groups. Use calibration when: the output is a risk score used for decision-making, stakeholders need to trust the meaning of scores, or the application is insurance, lending, or risk assessment. Use individual fairness when: group definitions are unclear or contested, or you want to avoid stereotyping within groups.
Metric Selection Guide
// Which fairness metric to use? Demographic Parity: Goal: Equal representation Use: Hiring, advertising, opportunities Question: "Are outcomes equal?" Equalized Odds: Goal: Equal accuracy Use: Criminal justice, medical diagnosis Question: "Are error rates equal?" Equal Opportunity (relaxed odds): Goal: Equal benefit for qualified Use: Loan approval, admissions Question: "Do qualified people get equal treatment?" Calibration: Goal: Honest scores Use: Risk assessment, insurance Question: "Do scores mean the same thing for everyone?" Individual Fairness: Goal: Consistent treatment Use: When groups are unclear Question: "Are similar people treated similarly?"
Key insight: The choice of fairness metric should involve stakeholders, not just engineers. Ask: who is harmed by false positives? By false negatives? What kind of equality matters most in this context? Document the choice and the reasoning.
code
Measuring Fairness in Practice
Tools and code for fairness evaluation
Fairness Tools
Several open-source tools make fairness measurement practical: Fairlearn (Microsoft) — Python library for assessing and improving fairness. Provides fairness metrics, visualization dashboards, and mitigation algorithms. AIF360 (IBM) — AI Fairness 360, a comprehensive toolkit with 70+ fairness metrics and 10+ mitigation algorithms. Aequitas (University of Chicago) — bias audit toolkit focused on decision-making systems. What-If Tool (Google) — visual interface for exploring model fairness without code. All of these compute the metrics we’ve discussed (demographic parity, equalized odds, calibration) and provide visualizations to help communicate findings to stakeholders.
Fairlearn Example
# Fairlearn: measure fairness metrics from fairlearn.metrics import ( MetricFrame, demographic_parity_difference, equalized_odds_difference, ) from sklearn.metrics import accuracy_score # Compute metrics by group mf = MetricFrame( metrics=accuracy_score, y_true=y_test, y_pred=y_pred, sensitive_features=gender, ) print(mf.by_group) # male: 0.95 # female: 0.87 # Demographic parity difference dp = demographic_parity_difference( y_test, y_pred, sensitive_features=gender) # dp = 0.15 (15% gap)
Key insight: Fairlearn is the most practical starting point for most teams. It integrates with scikit-learn, provides clear visualizations, and includes both measurement and mitigation tools. Start by measuring — you can’t improve what you don’t measure.