Ch 4: Bias Mitigation Techniques — AI Ethics & Responsible AI

Ch 4 — Bias Mitigation Techniques

Pre-processing, in-processing, and post-processing methods. Fairlearn, AIF360, and practical debiasing

Index

High Level

database

Data

arrow_forward

cleaning_services

Pre

arrow_forward

model_training

arrow_forward

output

Post

arrow_forward

monitoring

Audit

arrow_forward

check_circle

Deploy

Click play or press Space to begin...

Step- / 8

build

Three Intervention Points

Pre-processing, in-processing, and post-processing

Where to Intervene

Bias can be mitigated at three points in the ML pipeline: Pre-processing — modify the training data before the model sees it. Remove correlations between sensitive attributes and features, reweight or resample data to balance representation. In-processing — modify the training algorithm itself. Add fairness constraints to the optimization objective, use adversarial training to prevent the model from learning sensitive attributes. Post-processing — modify the model’s predictions after training. Apply group-specific thresholds, calibrate scores differently for each group. Each approach has trade-offs: pre-processing is model-agnostic but may lose information; in-processing is most principled but requires modifying the training loop; post-processing is easiest to implement but treats the model as a black box.

Intervention Points

// Three points of intervention Pre-processing (before training): Modify data → train normal model ✓ Model-agnostic ✓ Simple to implement ✗ May lose useful information In-processing (during training): Normal data → modified training ✓ Most principled approach ✓ Best fairness-accuracy trade-off ✗ Requires modifying training loop Post-processing (after training): Normal data → normal model → modify output ✓ Easiest to implement ✓ Works with any model ✗ Treats model as black box ✗ May produce inconsistent predictions // Recommendation: start with post-processing // (easiest), then try in-processing for // better fairness-accuracy trade-offs

Key insight: Post-processing is the pragmatic starting point: it works with any existing model and doesn’t require retraining. But for the best fairness-accuracy trade-off, in-processing methods (like Fairlearn’s ExponentiatedGradient) are superior.

cleaning_services

Pre-processing Methods

Fixing the data before training

Techniques

Reweighing: assign different weights to training examples so that the weighted dataset satisfies demographic parity. Overweight underrepresented group-label combinations, underweight overrepresented ones. Resampling: oversample underrepresented groups or undersample overrepresented groups to balance the dataset. Correlation removal (Fairlearn’s CorrelationRemover): apply a linear transformation to remove correlations between sensitive attributes and other features. An alpha parameter controls how aggressively correlations are removed. Data augmentation: generate synthetic examples for underrepresented groups (e.g., using SMOTE or generative models). Label correction: identify and correct biased labels using techniques like confident learning.

Reweighing Example

# Pre-processing: reweighing from aif360.algorithms.preprocessing import ( Reweighing ) rw = Reweighing( unprivileged_groups=[{"race": 0}], privileged_groups=[{"race": 1}] ) dataset_reweighed = rw.fit_transform(dataset) # Each example now has a weight # that balances group-label combos # Fairlearn: CorrelationRemover from fairlearn.preprocessing import ( CorrelationRemover ) cr = CorrelationRemover( sensitive_feature_ids=["race"], alpha=1.0 # full removal ) X_fair = cr.fit_transform(X)

Key insight: Reweighing is the simplest and most effective pre-processing method. It doesn’t discard any data — it just changes how much each example counts during training. It works with any model that supports sample weights.

model_training

In-processing Methods

Fairness-aware training algorithms

Techniques

Exponentiated Gradient (Fairlearn): the most practical in-processing method. It reduces fair classification to a sequence of standard classification problems. You specify a fairness constraint (demographic parity, equalized odds, etc.) and the algorithm finds the best model that satisfies it. Works with any scikit-learn classifier. Adversarial debiasing (Fairlearn, AIF360): trains two networks simultaneously — a predictor that makes predictions and an adversary that tries to infer the sensitive attribute from the predictions. The predictor is penalized for making predictions that reveal group membership. Constrained optimization: add fairness constraints directly to the loss function (e.g., minimize cross-entropy subject to equalized odds).

ExponentiatedGradient

# Fairlearn: ExponentiatedGradient from fairlearn.reductions import ( ExponentiatedGradient, DemographicParity, EqualizedOdds, ) from sklearn.linear_model import ( LogisticRegression ) # Choose fairness constraint constraint = DemographicParity() # or: EqualizedOdds() mitigator = ExponentiatedGradient( estimator=LogisticRegression(), constraints=constraint, ) mitigator.fit( X_train, y_train, sensitive_features=gender_train ) y_pred = mitigator.predict( X_test, sensitive_features=gender_test )

Key insight: ExponentiatedGradient is the recommended starting point for in-processing. It works with any scikit-learn estimator, supports multiple fairness definitions, and typically achieves the best fairness-accuracy trade-off among all methods.

output

Post-processing Methods

Adjusting predictions after training

Techniques

Threshold optimization (Fairlearn’s ThresholdOptimizer): the most practical post-processing method. Instead of using the same decision threshold (e.g., 0.5) for all groups, it finds group-specific thresholds that optimize accuracy subject to a fairness constraint. For example, the threshold for Group A might be 0.45 and for Group B might be 0.55, chosen to equalize positive prediction rates. Reject option classification: for predictions near the decision boundary (uncertain cases), override the model’s prediction in favor of the disadvantaged group. Calibration: apply group-specific calibration so that predicted probabilities mean the same thing for all groups (Platt scaling or isotonic regression per group).

ThresholdOptimizer

# Fairlearn: ThresholdOptimizer from fairlearn.postprocessing import ( ThresholdOptimizer ) # Wrap any existing classifier postprocessor = ThresholdOptimizer( estimator=trained_model, constraints="demographic_parity", objective="accuracy_score", ) postprocessor.fit( X_val, y_val, sensitive_features=gender_val ) # Predictions use group-specific thresholds y_fair = postprocessor.predict( X_test, sensitive_features=gender_test ) # Result: different thresholds per group # Group A threshold: 0.45 # Group B threshold: 0.55 # → Equalized positive prediction rates

Key insight: ThresholdOptimizer is the fastest path to fairer predictions. It works with any existing model, doesn’t require retraining, and can be applied as a simple wrapper. The trade-off: it requires knowing group membership at prediction time.

compare_arrows

The Fairness-Accuracy Trade-off

How much accuracy do you lose for fairness?

The Trade-off

Fairness constraints typically reduce overall accuracy — but often less than you’d expect. Research shows: small fairness improvements (reducing disparity from 15% to 10%) often cost less than 1% accuracy. Moderate improvements (reducing disparity from 15% to 5%) typically cost 1–3% accuracy. Perfect parity can cost 5–10% accuracy, depending on how different the base rates are. The Pareto frontier maps all achievable fairness-accuracy combinations. Fairlearn’s dashboard visualizes this frontier, helping you choose the right trade-off point. The key insight: you’re not choosing between “fair” and “accurate” — you’re choosing a point on a spectrum.

Trade-off Examples

// Fairness-accuracy trade-off (typical) Baseline (no fairness): Accuracy: 92% Disparity: 15% (between groups) Mild constraint: Accuracy: 91.5% (-0.5%) Disparity: 10% // Small accuracy cost, big fairness gain Moderate constraint: Accuracy: 90% (-2%) Disparity: 5% // Reasonable trade-off Strict parity: Accuracy: 85% (-7%) Disparity: 0% // Significant accuracy cost // The "sweet spot" is usually the // moderate constraint: big fairness // improvement with small accuracy cost

Key insight: The first few percentage points of fairness improvement are nearly free in terms of accuracy. The cost increases steeply as you approach perfect parity. Most teams should aim for the “sweet spot” — significant fairness improvement with minimal accuracy loss.

smart_toy

Debiasing LLMs

Techniques specific to large language models

LLM Debiasing

LLMs require different debiasing approaches: RLHF (Reinforcement Learning from Human Feedback) — train the model to prefer outputs that human raters judge as unbiased. This is how ChatGPT, Claude, and Gemini are aligned. Constitutional AI (Anthropic) — define a set of principles (“constitution”) and train the model to follow them, reducing reliance on human raters. Prompt engineering — include debiasing instructions in the system prompt (“Evaluate candidates based only on qualifications, not names or demographics”). Output filtering — apply guardrails to detect and block biased outputs. Fine-tuning on balanced data — fine-tune the model on a carefully curated, balanced dataset. Representation engineering — identify and modify the internal representations that encode bias.

LLM Debiasing Methods

// LLM-specific debiasing 1. RLHF: Human raters score outputs for bias Model trained to prefer unbiased outputs Used by: OpenAI, Google, Anthropic 2. Constitutional AI: Define principles ("be fair", "don't stereotype") Model self-critiques against principles Used by: Anthropic (Claude) 3. Prompt Engineering: System: "Evaluate based on qualifications only. Do not consider names, gender, or demographics." // Cheapest, fastest intervention 4. Output Guardrails: Detect bias in generated text Block or rewrite biased outputs // Post-processing for LLMs 5. Fine-tuning: Balanced, curated training data // Expensive but effective

Key insight: For most LLM applications, prompt engineering is the first line of defense against bias. It’s free, instant, and surprisingly effective. Combine it with output guardrails for a practical debiasing pipeline.

warning

Pitfalls & Limitations

What can go wrong with bias mitigation

Common Pitfalls

Fairness gerrymandering: a model can satisfy fairness constraints for predefined groups while still being unfair to subgroups (e.g., fair for women and fair for Black people, but unfair for Black women). Metric gaming: optimizing for one fairness metric can worsen another (impossibility theorem). Fairness washing: using fairness tools superficially to claim compliance without genuine commitment to equity. Ignoring intersectionality: most fairness tools only handle one sensitive attribute at a time; real people belong to multiple groups simultaneously. Static fairness: achieving fairness at deployment doesn’t guarantee fairness over time (feedback loops, distribution shifts). Over-reliance on technical fixes: bias mitigation algorithms can’t fix fundamentally flawed data or unjust systems.

Pitfall Examples

// Bias mitigation pitfalls Fairness gerrymandering: Fair for women: ✓ Fair for Black people: ✓ Fair for Black women: ✗ // Subgroup unfairness hidden Metric gaming: Optimize demographic parity: ✓ Equalized odds worsens: ✗ // Impossibility theorem strikes Static fairness: Fair at deployment: ✓ Fair after 6 months: ✗ // Feedback loops erode fairness Intersectionality gap: Most tools: 1 sensitive attribute Reality: race × gender × age × ... // Combinatorial explosion The deeper problem: Algorithms can't fix unjust systems Technical fixes ≠ systemic change // Both are needed

Key insight: Bias mitigation is necessary but not sufficient. Technical fixes address symptoms; systemic change addresses root causes. The most responsible approach combines algorithmic fairness with organizational commitment to equity.

checklist

Practical Workflow

A step-by-step process for bias mitigation

The Workflow

Step 1: Define — identify protected groups and choose the appropriate fairness metric for your context (involve stakeholders). Step 2: Measure — compute disaggregated metrics and fairness scores using Fairlearn or AIF360. Establish a baseline. Step 3: Mitigate — start with post-processing (ThresholdOptimizer), then try in-processing (ExponentiatedGradient) for better trade-offs. Step 4: Evaluate — check that mitigation improved the target metric without unacceptable accuracy loss. Check for fairness gerrymandering (subgroup analysis). Step 5: Document — create a model card documenting the fairness analysis, chosen metric, trade-offs, and limitations. Step 6: Monitor — track fairness metrics in production over time. Feedback loops can erode fairness after deployment.

Workflow Steps

// Bias mitigation workflow 1. Define: Protected groups: gender, race, age Fairness metric: equalized odds Stakeholder input: ✓ 2. Measure: Fairlearn MetricFrame Disaggregated accuracy, FPR, TPR Baseline disparity: 15% 3. Mitigate: Try: ThresholdOptimizer (quick) Try: ExponentiatedGradient (better) Compare fairness-accuracy trade-offs 4. Evaluate: Target metric improved: ✓ Accuracy loss acceptable: ✓ Subgroup analysis: ✓ 5. Document: Model card with fairness analysis Trade-offs and limitations Chosen metric and rationale 6. Monitor: Track fairness metrics in production Alert on disparity increase Re-evaluate quarterly

Key insight: Bias mitigation is not a one-time fix — it’s a continuous process. Measure → mitigate → monitor → repeat. The most important step is the first one: defining what fairness means for your specific context, with input from affected stakeholders.

arrow_back Ch 3: Fairness Definitions & Metrics Ch 5: Explainability & Interpretability arrow_forward