Ch 2: Bias in AI Systems — AI Ethics & Responsible AI

Ch 2 — Bias in AI Systems

Sources of bias (data, algorithmic, societal), types of bias, and real-world case studies

Index

High Level

public

Society

arrow_forward

database

Data

arrow_forward

model_training

Model

arrow_forward

output

Output

arrow_forward

groups

Impact

arrow_forward

loop

Feedback

Click play or press Space to begin...

Step- / 8

public

Historical & Societal Bias

Bias that exists in the world before any algorithm is built

Where Bias Begins

Historical bias arises when the world itself is unfair, and data faithfully reflects that unfairness. If women were historically underrepresented in tech leadership, a dataset of past leaders will be mostly men — and an AI trained on it will learn that “leader = male.” Societal bias is embedded in language, culture, and institutions. Word embeddings trained on internet text learn that “doctor” is closer to “man” and “nurse” is closer to “woman.” These biases exist before any algorithm is built. The AI doesn’t create them — it inherits and amplifies them. This is the hardest type of bias to address because the “ground truth” data is itself biased.

Bias Pipeline

// How bias flows into AI systems World (historical inequality) ↓ Data (reflects that inequality) ↓ Model (learns the patterns) ↓ Predictions (reproduces inequality) ↓ Decisions (affects real people) ↓ Feedback (reinforces the cycle) // Example: hiring Past hires: 80% male engineers → Data: successful = male pattern → Model: scores men higher → Hiring: more men hired → Future data: still 80% male → Cycle perpetuates

Key insight: AI doesn’t just reflect existing bias — it can amplify it through feedback loops. A biased hiring model produces biased hiring data, which trains the next model to be even more biased. Breaking this cycle requires intentional intervention.

database

Data Bias

Bias introduced through data collection and preparation

Types of Data Bias

Representation bias: the training data doesn’t represent the population the model will serve. ImageNet was predominantly Western images; dermatology datasets were predominantly light skin. Selection bias: the data collection process systematically excludes certain groups. Surveys that only reach internet users miss offline populations. Measurement bias: the way data is collected introduces systematic errors. Arrest data reflects policing patterns, not actual crime rates — neighborhoods with more police have more arrests. Label bias: human annotators bring their own biases to labeling. Studies show annotators rate identical text differently depending on the perceived race of the author. Temporal bias: data from one time period may not apply to another. COVID-19 changed consumer behavior, invalidating pre-pandemic models.

Data Bias Types

// Types of data bias Representation: ImageNet: 45% US images, 1% Africa Dermatology: 80%+ light skin → Model fails on underrepresented groups Selection: Online surveys miss offline populations Hospital data misses uninsured patients → Invisible populations Measurement: Arrest data ≠ crime data More police → more arrests → Policing bias encoded as "crime" Label: Annotator demographics affect labels "Toxic" text labels vary by culture → Subjective judgments become "truth" Temporal: Pre-COVID models fail post-COVID Consumer behavior shifts → Stale data, wrong predictions

Key insight: Measurement bias is the most insidious because it looks like objective data. Arrest records seem factual, but they measure policing intensity, not crime. Using arrest data to predict crime creates a self-fulfilling prophecy: more policing → more arrests → more predicted crime → more policing.

model_training

Algorithmic Bias

Bias introduced by model design and optimization

How Algorithms Create Bias

Even with perfectly balanced data, algorithms can introduce bias: Objective function bias — optimizing for overall accuracy means the model performs best on the majority group and worst on minorities. A 95% accurate model might be 99% accurate for Group A (90% of data) and 60% accurate for Group B (10% of data). Feature selection bias — using zip code as a feature is a proxy for race in many US cities. The model doesn’t need explicit race data to discriminate. Aggregation bias — a single model for all populations may fail for subgroups with different patterns. HbA1c thresholds for diabetes differ across ethnicities, but a single threshold disadvantages some groups. Evaluation bias — testing on a non-representative benchmark gives a false sense of performance.

Algorithmic Bias Examples

// How algorithms create bias Objective function: Optimize: overall accuracy = 95% Group A (90% of data): 99% accurate Group B (10% of data): 60% accurate // "High accuracy" hides disparities Proxy features: Zip code → proxy for race Name → proxy for ethnicity School → proxy for socioeconomic status // No explicit protected attribute needed Aggregation: One model for all populations Diabetes: HbA1c thresholds differ by ethnicity → single threshold fails // One-size-fits-all doesn't fit all Evaluation: Benchmark: 90% Group A, 10% Group B Model: 95% on benchmark Reality: 60% for Group B // Benchmark doesn't represent reality

Key insight: Proxy features are the most common source of algorithmic discrimination. Removing the “race” column doesn’t prevent racial bias — zip code, name, and other features carry the same information. Fairness requires testing outcomes, not just inputs.

loop

Feedback Loops

How biased predictions create more biased data

The Amplification Problem

AI systems don’t just reflect bias — they amplify it through feedback loops. Predictive policing: the model predicts high crime in neighborhoods with more historical arrests → police are sent there → more arrests occur → the model sees more “crime” → predicts even more crime. The prediction becomes self-fulfilling. Recommendation systems: users click on content the algorithm shows them → the algorithm learns those preferences → shows more of the same → creates filter bubbles and echo chambers. Credit scoring: people in underserved areas get denied credit → they can’t build credit history → future models see no credit history → deny credit again. Breaking feedback loops requires: monitoring outcomes over time, injecting randomness (exploration), and regularly auditing for disparate impact.

Feedback Loop Examples

// Feedback loops amplify bias Predictive Policing: Historical arrests → "high crime" area → More police deployed → More arrests → Model sees more "crime" → Predicts even more crime → Self-fulfilling prophecy Recommendations: User clicks shown content → Algorithm learns preference → Shows more of the same → User clicks more → Filter bubble / echo chamber Credit Scoring: No credit history → denied credit → Can't build credit history → Still no credit history → Denied again → Poverty trap // Breaking the loop: // 1. Monitor outcomes over time // 2. Inject exploration/randomness // 3. Audit for disparate impact

Key insight: Feedback loops are the mechanism by which AI turns small biases into large ones. A 5% disparity at deployment can become a 50% disparity after a year of feedback. Monitoring outcomes over time — not just at launch — is essential.

face

Case Study: Facial Recognition

Gender Shades and the disparate impact of computer vision

The Gender Shades Study

In 2018, Joy Buolamwini and Timnit Gebru published the Gender Shades study, testing commercial facial recognition systems from Microsoft, IBM, and Face++ on a balanced dataset of faces across skin tones and genders. Results: light-skinned males: 0.8% error rate. Dark-skinned females: 34.7% error rate. That’s a 43x difference in error rate. The cause: training datasets were overwhelmingly light-skinned and male. The systems worked well for the population they were trained on and failed catastrophically for everyone else. After the study, Microsoft and IBM improved their systems significantly. The study demonstrated that auditing AI for bias works — companies fix problems when they’re exposed.

Gender Shades Results

// Gender Shades (Buolamwini & Gebru, 2018) Error rates by demographic: Light-skinned male: 0.8% Light-skinned female: 7.0% Dark-skinned male: 12.0% Dark-skinned female: 34.7% // 43x worse for dark-skinned women Root cause: Training data: mostly light-skinned Benchmark data: mostly light-skinned Developers: mostly light-skinned // Homogeneity at every level After the study: Microsoft: reduced gap significantly IBM: improved dark-skin accuracy // Auditing works → companies respond Lesson: Test on diverse, balanced datasets Report disaggregated metrics Include affected communities

Key insight: Overall accuracy hides disparities. A facial recognition system that is “99% accurate” might be 99.9% accurate for one group and 65% for another. Always report disaggregated metrics — accuracy broken down by demographic group.

work

Case Study: Hiring AI

Amazon’s biased resume screener and the broader hiring problem

Amazon’s Hiring Tool

In the early 2010s, Amazon built an AI system to automate resume screening. The system was trained on resumes submitted over a 10-year period — during which most successful hires were men (reflecting the tech industry’s gender imbalance). The AI learned to penalize resumes containing the word “women’s” (as in “women’s chess club captain”) and preferred candidates from all-male colleges. Amazon tried to fix the bias but couldn’t guarantee neutral recommendations. They scrapped the tool in 2018. The broader problem: a 2024 University of Washington study found that LLMs (GPT-4, Claude, Gemini) preferred white-associated names 85% of the time when evaluating identical resumes. Today, 83% of companies use AI for resume screening, and 50% use AI for initial rejections — many candidates never see a human reviewer.

Hiring Bias at Scale

// AI hiring bias Amazon (2018): Trained on: 10 years of resumes Problem: tech industry = mostly male Result: penalized "women's" in resumes Outcome: tool scrapped LLM Resume Study (2024): GPT-4, Claude, Gemini tested Identical resumes, different names White-associated names preferred: 85% Black-associated names preferred: 9% Scale of impact: 83% of companies use AI screening 50% use AI for initial rejections Millions of candidates affected Many never see a human reviewer // The fix isn't just technical: // 1. Audit for disparate impact // 2. Human review for rejections // 3. Diverse training data // 4. Regular bias testing

Key insight: Amazon’s case shows that bias can’t always be “fixed” after the fact. When the training data fundamentally encodes historical discrimination, the most responsible choice may be to not deploy the system at all.

smart_toy

Bias in LLMs

How large language models inherit and amplify societal biases

LLM Bias Sources

LLMs are trained on internet text, which reflects every bias in human society. Stereotyping: LLMs associate professions with genders (“nurse” → female, “engineer” → male), races with traits, and religions with sentiments. Representation: English-centric training data means LLMs perform worse in non-English languages and underrepresent non-Western perspectives. Toxicity: internet text contains hate speech, which LLMs can reproduce. Sycophancy: RLHF training can make models agree with users rather than provide accurate information, reinforcing existing beliefs. Cultural bias: models reflect predominantly Western, English-speaking, educated perspectives. The challenge is unique: LLMs are general-purpose, so bias can manifest in unpredictable ways across millions of use cases.

LLM Bias Examples

// LLM bias manifestations Stereotyping: "The nurse checked her patients" "The engineer reviewed his code" // Gendered default pronouns Name bias: "Jamal's resume" → lower ratings "James's resume" → higher ratings // Identical content, different names Cultural: Western-centric worldview English performance >> other languages US-centric legal/social norms Toxicity: Can generate hate speech if prompted Guardrails help but aren't perfect Jailbreaks bypass safety training Sycophancy: Agrees with user's stated position Even when user is factually wrong // RLHF reward: user satisfaction

Key insight: LLM bias is harder to audit than traditional ML bias because the output space is infinite. You can’t test every possible prompt. Red teaming, systematic bias benchmarks, and continuous monitoring are essential — but no approach catches everything.

Detecting Bias

How to find bias before it causes harm

Detection Approaches

Disaggregated evaluation: break down model performance by demographic group. Report accuracy, precision, recall, and false positive/negative rates for each group separately. Disparate impact testing: check if the model’s decisions disproportionately affect one group. The “four-fifths rule” (US EEOC): if the selection rate for a protected group is less than 80% of the rate for the most-selected group, there’s evidence of adverse impact. Counterfactual testing: change the protected attribute (e.g., swap names, genders) and check if the prediction changes. Bias benchmarks: standardized datasets designed to test for specific biases (WinoBias for gender, BBQ for social biases). Red teaming: adversarial testing by diverse teams to find bias in edge cases.

Detection Methods

// Bias detection toolkit 1. Disaggregated Metrics: accuracy_male = 0.95 accuracy_female = 0.87 accuracy_overall = 0.92 // Overall hides the gap! 2. Four-Fifths Rule: selection_rate_A = 0.60 selection_rate_B = 0.40 ratio = 0.40 / 0.60 = 0.67 0.67 < 0.80 → adverse impact! 3. Counterfactual: "John is a great candidate" → 0.85 "Jamal is a great candidate" → 0.72 // Same text, different name → bias 4. Benchmarks: WinoBias (gender stereotypes) BBQ (social biases) BOLD (bias in open-ended generation)

Key insight: The most important bias detection technique is the simplest: disaggregated metrics. Never report only overall accuracy. Always break it down by demographic group. If you can’t measure it, you can’t fix it.

arrow_back Ch 1: Why AI Ethics Matters Ch 3: Fairness Definitions & Metrics arrow_forward