Ch 7: Probability Foundations — Mathematics Behind AI & ML

umbrella

Should You Carry an Umbrella?

Probability is how you make decisions with incomplete information

The Analogy

You look outside. Dark clouds. The weather app says 70% chance of rain. Do you carry an umbrella? You’re making a decision under uncertainty — you don’t know for sure if it will rain, but you have evidence (clouds, forecast) that updates your belief. Probability is the math of quantifying and reasoning about uncertainty.

Key insight: Every AI prediction is a probability. When GPT says “the next word is likely ‘the’”, it’s outputting P(next_word = “the”) = 0.23. When a medical AI says “85% chance of pneumonia,” it’s a probability. AI doesn’t give certainties — it gives calibrated uncertainties.

The Basics

# Probability: a number between 0 and 1 # 0 = impossible, 1 = certain # P(rain) = 0.70 (70% chance) # P(no rain) = 1 - 0.70 = 0.30 # Sample space: all possible outcomes # Coin: {heads, tails} # Die: {1, 2, 3, 4, 5, 6} # LLM: {every word in vocabulary} # All probabilities must sum to 1 # P(heads) + P(tails) = 0.5 + 0.5 = 1 # Σ P(word_i) = 1 for all vocab words

Real World

70% rain → carry umbrella (decision under uncertainty)

In AI

P(“the”) = 0.23 → most likely next word (language model output)

casino

Events & Sample Spaces

The universe of possible outcomes

The Analogy

A sample space is like a menu at a restaurant — all possible dishes you could order. An event is a subset: “I order something vegetarian.” The probability of an event is the fraction of the menu that satisfies your criteria. For a fair die, P(even) = 3/6 = 0.5 because 3 out of 6 faces are even.

Key insight: An LLM’s sample space is its entire vocabulary (50,000+ tokens). At each step, it assigns a probability to every single token. The softmax function (Ch 11) ensures these probabilities sum to 1. Generating text = repeatedly sampling from this distribution.

Worked Example

# Union: P(A or B) = P(A) + P(B) - P(A and B) # Die: P(even OR >4) # P(even) = 3/6, P(>4) = 2/6 # P(even AND >4) = P({6}) = 1/6 # P(even OR >4) = 3/6 + 2/6 - 1/6 = 4/6 # Complement: P(not A) = 1 - P(A) # P(not rain) = 1 - P(rain) = 0.30 # LLM vocabulary example: import torch.nn.functional as F logits = model(input_ids) # raw scores probs = F.softmax(logits, dim=-1) # probs.sum() = 1.0 (guaranteed)

help

Conditional Probability

How does knowing one thing change the probability of another?

The Analogy

The probability of rain is 30%. But if you see dark clouds, it jumps to 80%. The clouds are evidence that changes your belief. Conditional probability P(A|B) answers: “What’s the probability of A, given that I know B?” Knowing B narrows the sample space and changes the odds.

Key insight: Every layer of a neural network computes a conditional probability. Given the input so far, what’s the probability of each possible output? An LLM computes P(next_word | all_previous_words) at every step. The entire model is a conditional probability machine.

Worked Example

# P(A|B) = P(A and B) / P(B) # Medical test example: # P(disease) = 0.01 (1% have it) # P(positive | disease) = 0.99 (99% accurate) # P(positive | no disease) = 0.05 (5% false+) # If you test positive, what's P(disease)? # NOT 99%! We need Bayes' theorem... # LLM conditional probability: # P("Paris" | "The capital of France is") # = very high (≈ 0.85) # P("banana" | "The capital of France is") # = very low (≈ 0.0001)

Real World

P(rain | dark clouds) = 80% — evidence updates belief

In AI

P(“Paris” | “capital of France is”) ≈ 0.85 — context shapes prediction

update

Bayes’ Theorem — Updating Your Beliefs

The most important formula in AI reasoning

The Analogy

You think there’s a 30% chance of rain (prior belief). Then you see dark clouds (evidence). Bayes’ theorem tells you how to update your belief to get the posterior: P(rain | clouds). It balances what you believed before with how likely the evidence is under each scenario.

Key insight: The medical test paradox: even with a 99% accurate test, if only 1% of people have the disease, a positive result means only ~17% chance of actually having it. Bayes’ theorem reveals this counterintuitive truth. The base rate (prior) matters enormously.

Worked Example

# Bayes' Theorem: # P(A|B) = P(B|A) × P(A) / P(B) # Medical test (continuing from Step 3): P_disease = 0.01 P_pos_given_disease = 0.99 P_pos_given_healthy = 0.05 # P(positive) = P(pos|D)×P(D) + P(pos|H)×P(H) P_pos = 0.99*0.01 + 0.05*0.99 # 0.0594 # P(disease | positive) P_disease_given_pos = (0.99 * 0.01) / 0.0594 # = 0.167 — only 16.7%! # NOT 99%! The base rate matters.

Formula: P(A|B) = P(B|A) × P(A) / P(B). Prior × Likelihood / Evidence = Posterior. This is the engine behind spam filters, medical AI, and Bayesian neural networks.

link_off

Independence — When Events Don’t Affect Each Other

Coin flips don’t care about each other

The Analogy

Flipping a coin twice: the second flip doesn’t care about the first. They’re independent. But drawing cards without replacement: the second draw IS affected by the first (fewer cards left). Independence means P(A and B) = P(A) × P(B). It’s a powerful simplification — and the key assumption behind Naive Bayes.

Key insight: Naive Bayes assumes all features are independent given the class — that’s the “naive” part. Words in an email are NOT independent (“Nigerian” and “prince” co-occur). But this “wrong” assumption works shockingly well in practice because the errors roughly cancel out.

Worked Example

# Independent: P(A and B) = P(A) × P(B) # Two coin flips: # P(HH) = P(H) × P(H) = 0.5 × 0.5 = 0.25 # NOT independent: cards without replacement # P(Ace₁) = 4/52 # P(Ace₂ | Ace₁) = 3/51 (not 4/52!) # Conditional independence (Naive Bayes): # P(w₁,w₂,...,wₙ | spam) # ≈ P(w₁|spam) × P(w₂|spam) × ... × P(wₙ|spam) # Assumes words are independent given class # Wrong but useful!

Real World

Coin flips are independent. Card draws are not.

In AI

Naive Bayes assumes feature independence — wrong but surprisingly effective

grid_view

Joint & Marginal Distributions

The full picture vs. the summary

The Analogy

A joint distribution is like a spreadsheet showing every combination: P(weather, traffic) for all weather/traffic pairs. The marginal distribution is the row or column total: P(traffic) regardless of weather. Marginalization = summing over the variable you don’t care about, collapsing the table to one dimension.

Key insight: In generative AI, the model learns a joint distribution P(x) over all possible images/text. Generating a sample = drawing from this joint distribution. Conditional generation (like text-to-image) uses P(image | text), which comes from the joint via Bayes’ theorem.

Worked Example

# Joint distribution: P(X, Y) # Y=0 Y=1 # X=0 0.30 0.10 | 0.40 # X=1 0.20 0.40 | 0.60 # 0.50 0.50 | 1.00 # Marginal: P(X=0) = 0.30 + 0.10 = 0.40 # (sum over all Y values) # Conditional from joint: # P(Y=1|X=1) = P(X=1,Y=1)/P(X=1) # = 0.40 / 0.60 = 0.667 # Marginalization in PyTorch: joint = torch.tensor([[.3,.1],[.2,.4]]) P_X = joint.sum(dim=1) # [0.4, 0.6] P_Y = joint.sum(dim=0) # [0.5, 0.5]

mail

Naive Bayes Spam Filter

Putting it all together: Bayes + independence = spam detection

The Analogy

Your email inbox is a courtroom. Each email is on trial: spam or not spam? The words in the email are the evidence. The spam filter is the judge who uses Bayes’ theorem: “Given that this email contains ‘free,’ ‘winner,’ and ‘click,’ what’s the probability it’s spam?” Each word independently contributes evidence (Naive Bayes assumption).

Key insight: Naive Bayes was one of the first successful ML algorithms in production. Gmail’s original spam filter used it. Despite being “naive,” it achieves 95%+ accuracy because the independence assumption, while wrong, doesn’t hurt classification much.

Worked Example

from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import CountVectorizer emails = ["free money click now", "meeting tomorrow 3pm", "winner prize claim", "project deadline friday"] labels = [1, 0, 1, 0] # 1=spam, 0=ham vec = CountVectorizer() X = vec.fit_transform(emails) clf = MultinomialNB() clf.fit(X, labels) test = vec.transform(["free prize click"]) clf.predict_proba(test) # → [[0.05, 0.95]] (95% spam)

psychology

Why Every AI Prediction Is a Probability

From spam filters to GPT — uncertainty is the output

The Big Picture

Every AI system outputs probabilities, not certainties. A classifier outputs P(cat | image). An LLM outputs P(next_token | context). A self-driving car outputs P(pedestrian | sensor_data). The entire field of machine learning is about learning probability distributions from data and using them to make decisions under uncertainty.

Why it matters for AI: Calibrated probabilities save lives. A medical AI that says “90% pneumonia” should be right 90% of the time it says that. If it’s only right 60% of the time, it’s overconfident and dangerous. Probability calibration is an active research area in AI safety.

Probability in Modern AI

# Classification: P(class | input) probs = model(image) # [P(cat)=0.85, P(dog)=0.12, P(bird)=0.03] # Language model: P(token | context) next_token_probs = lm(context) # P("the")=0.23, P("a")=0.15, ... # Sampling with temperature: # T=0: always pick highest prob (greedy) # T=1: sample from distribution as-is # T>1: flatten probs (more random) # T<1: sharpen probs (more deterministic) scaled = logits / temperature probs = F.softmax(scaled, dim=-1)

Real World

Weather forecast: 70% rain — calibrated uncertainty guides decisions

In AI

Every prediction is a probability distribution over possible outcomes

Ch 7 — Probability Foundations