Ch 7 — Probability Foundations

Deciding whether to carry an umbrella — how AI makes decisions with incomplete information
Probability
umbrella
Umbrella
arrow_forward
casino
Events
arrow_forward
help
Conditional
arrow_forward
update
Bayes
arrow_forward
link_off
Independence
arrow_forward
grid_view
Joint
arrow_forward
mail
Spam Filter
-
Click play or press Space to begin...
Step- / 8
umbrella
Should You Carry an Umbrella?
Probability is how you make decisions with incomplete information
The Analogy
You look outside. Dark clouds. The weather app says 70% chance of rain. Do you carry an umbrella? You’re making a decision under uncertainty — you don’t know for sure if it will rain, but you have evidence (clouds, forecast) that updates your belief. Probability is the math of quantifying and reasoning about uncertainty.
Key insight: Every AI prediction is a probability. When GPT says “the next word is likely ‘the’”, it’s outputting P(next_word = “the”) = 0.23. When a medical AI says “85% chance of pneumonia,” it’s a probability. AI doesn’t give certainties — it gives calibrated uncertainties.
The Basics
# Probability: a number between 0 and 1 # 0 = impossible, 1 = certain # P(rain) = 0.70 (70% chance) # P(no rain) = 1 - 0.70 = 0.30 # Sample space: all possible outcomes # Coin: {heads, tails} # Die: {1, 2, 3, 4, 5, 6} # LLM: {every word in vocabulary} # All probabilities must sum to 1 # P(heads) + P(tails) = 0.5 + 0.5 = 1 # Σ P(word_i) = 1 for all vocab words
Real World
70% rain → carry umbrella (decision under uncertainty)
In AI
P(“the”) = 0.23 → most likely next word (language model output)
casino
Events & Sample Spaces
The universe of possible outcomes
The Analogy
A sample space is like a menu at a restaurant — all possible dishes you could order. An event is a subset: “I order something vegetarian.” The probability of an event is the fraction of the menu that satisfies your criteria. For a fair die, P(even) = 3/6 = 0.5 because 3 out of 6 faces are even.
Key insight: An LLM’s sample space is its entire vocabulary (50,000+ tokens). At each step, it assigns a probability to every single token. The softmax function (Ch 11) ensures these probabilities sum to 1. Generating text = repeatedly sampling from this distribution.
Worked Example
# Union: P(A or B) = P(A) + P(B) - P(A and B) # Die: P(even OR >4) # P(even) = 3/6, P(>4) = 2/6 # P(even AND >4) = P({6}) = 1/6 # P(even OR >4) = 3/6 + 2/6 - 1/6 = 4/6 # Complement: P(not A) = 1 - P(A) # P(not rain) = 1 - P(rain) = 0.30 # LLM vocabulary example: import torch.nn.functional as F logits = model(input_ids) # raw scores probs = F.softmax(logits, dim=-1) # probs.sum() = 1.0 (guaranteed)
help
Conditional Probability
How does knowing one thing change the probability of another?
The Analogy
The probability of rain is 30%. But if you see dark clouds, it jumps to 80%. The clouds are evidence that changes your belief. Conditional probability P(A|B) answers: “What’s the probability of A, given that I know B?” Knowing B narrows the sample space and changes the odds.
Key insight: Every layer of a neural network computes a conditional probability. Given the input so far, what’s the probability of each possible output? An LLM computes P(next_word | all_previous_words) at every step. The entire model is a conditional probability machine.
Worked Example
# P(A|B) = P(A and B) / P(B) # Medical test example: # P(disease) = 0.01 (1% have it) # P(positive | disease) = 0.99 (99% accurate) # P(positive | no disease) = 0.05 (5% false+) # If you test positive, what's P(disease)? # NOT 99%! We need Bayes' theorem... # LLM conditional probability: # P("Paris" | "The capital of France is") # = very high (≈ 0.85) # P("banana" | "The capital of France is") # = very low (≈ 0.0001)
Real World
P(rain | dark clouds) = 80% — evidence updates belief
In AI
P(“Paris” | “capital of France is”) ≈ 0.85 — context shapes prediction
update
Bayes’ Theorem — Updating Your Beliefs
The most important formula in AI reasoning
The Analogy
You think there’s a 30% chance of rain (prior belief). Then you see dark clouds (evidence). Bayes’ theorem tells you how to update your belief to get the posterior: P(rain | clouds). It balances what you believed before with how likely the evidence is under each scenario.
Key insight: The medical test paradox: even with a 99% accurate test, if only 1% of people have the disease, a positive result means only ~17% chance of actually having it. Bayes’ theorem reveals this counterintuitive truth. The base rate (prior) matters enormously.
Worked Example
# Bayes' Theorem: # P(A|B) = P(B|A) × P(A) / P(B) # Medical test (continuing from Step 3): P_disease = 0.01 P_pos_given_disease = 0.99 P_pos_given_healthy = 0.05 # P(positive) = P(pos|D)×P(D) + P(pos|H)×P(H) P_pos = 0.99*0.01 + 0.05*0.99 # 0.0594 # P(disease | positive) P_disease_given_pos = (0.99 * 0.01) / 0.0594 # = 0.167 — only 16.7%! # NOT 99%! The base rate matters.
Formula: P(A|B) = P(B|A) × P(A) / P(B). Prior × Likelihood / Evidence = Posterior. This is the engine behind spam filters, medical AI, and Bayesian neural networks.
link_off
Independence — When Events Don’t Affect Each Other
Coin flips don’t care about each other
The Analogy
Flipping a coin twice: the second flip doesn’t care about the first. They’re independent. But drawing cards without replacement: the second draw IS affected by the first (fewer cards left). Independence means P(A and B) = P(A) × P(B). It’s a powerful simplification — and the key assumption behind Naive Bayes.
Key insight: Naive Bayes assumes all features are independent given the class — that’s the “naive” part. Words in an email are NOT independent (“Nigerian” and “prince” co-occur). But this “wrong” assumption works shockingly well in practice because the errors roughly cancel out.
Worked Example
# Independent: P(A and B) = P(A) × P(B) # Two coin flips: # P(HH) = P(H) × P(H) = 0.5 × 0.5 = 0.25 # NOT independent: cards without replacement # P(Ace₁) = 4/52 # P(Ace₂ | Ace₁) = 3/51 (not 4/52!) # Conditional independence (Naive Bayes): # P(w₁,w₂,...,wₙ | spam) # ≈ P(w₁|spam) × P(w₂|spam) × ... × P(wₙ|spam) # Assumes words are independent given class # Wrong but useful!
Real World
Coin flips are independent. Card draws are not.
In AI
Naive Bayes assumes feature independence — wrong but surprisingly effective
grid_view
Joint & Marginal Distributions
The full picture vs. the summary
The Analogy
A joint distribution is like a spreadsheet showing every combination: P(weather, traffic) for all weather/traffic pairs. The marginal distribution is the row or column total: P(traffic) regardless of weather. Marginalization = summing over the variable you don’t care about, collapsing the table to one dimension.
Key insight: In generative AI, the model learns a joint distribution P(x) over all possible images/text. Generating a sample = drawing from this joint distribution. Conditional generation (like text-to-image) uses P(image | text), which comes from the joint via Bayes’ theorem.
Worked Example
# Joint distribution: P(X, Y) # Y=0 Y=1 # X=0 0.30 0.10 | 0.40 # X=1 0.20 0.40 | 0.60 # 0.50 0.50 | 1.00 # Marginal: P(X=0) = 0.30 + 0.10 = 0.40 # (sum over all Y values) # Conditional from joint: # P(Y=1|X=1) = P(X=1,Y=1)/P(X=1) # = 0.40 / 0.60 = 0.667 # Marginalization in PyTorch: joint = torch.tensor([[.3,.1],[.2,.4]]) P_X = joint.sum(dim=1) # [0.4, 0.6] P_Y = joint.sum(dim=0) # [0.5, 0.5]
mail
Naive Bayes Spam Filter
Putting it all together: Bayes + independence = spam detection
The Analogy
Your email inbox is a courtroom. Each email is on trial: spam or not spam? The words in the email are the evidence. The spam filter is the judge who uses Bayes’ theorem: “Given that this email contains ‘free,’ ‘winner,’ and ‘click,’ what’s the probability it’s spam?” Each word independently contributes evidence (Naive Bayes assumption).
Key insight: Naive Bayes was one of the first successful ML algorithms in production. Gmail’s original spam filter used it. Despite being “naive,” it achieves 95%+ accuracy because the independence assumption, while wrong, doesn’t hurt classification much.
Worked Example
from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import CountVectorizer emails = ["free money click now", "meeting tomorrow 3pm", "winner prize claim", "project deadline friday"] labels = [1, 0, 1, 0] # 1=spam, 0=ham vec = CountVectorizer() X = vec.fit_transform(emails) clf = MultinomialNB() clf.fit(X, labels) test = vec.transform(["free prize click"]) clf.predict_proba(test) # → [[0.05, 0.95]] (95% spam)
psychology
Why Every AI Prediction Is a Probability
From spam filters to GPT — uncertainty is the output
The Big Picture
Every AI system outputs probabilities, not certainties. A classifier outputs P(cat | image). An LLM outputs P(next_token | context). A self-driving car outputs P(pedestrian | sensor_data). The entire field of machine learning is about learning probability distributions from data and using them to make decisions under uncertainty.
Why it matters for AI: Calibrated probabilities save lives. A medical AI that says “90% pneumonia” should be right 90% of the time it says that. If it’s only right 60% of the time, it’s overconfident and dangerous. Probability calibration is an active research area in AI safety.
Probability in Modern AI
# Classification: P(class | input) probs = model(image) # [P(cat)=0.85, P(dog)=0.12, P(bird)=0.03] # Language model: P(token | context) next_token_probs = lm(context) # P("the")=0.23, P("a")=0.15, ... # Sampling with temperature: # T=0: always pick highest prob (greedy) # T=1: sample from distribution as-is # T>1: flatten probs (more random) # T<1: sharpen probs (more deterministic) scaled = logits / temperature probs = F.softmax(scaled, dim=-1)
Real World
Weather forecast: 70% rain — calibrated uncertainty guides decisions
In AI
Every prediction is a probability distribution over possible outcomes