Ch 6: Naive Bayes & Probabilistic Models

Ch 6 — Naive Bayes & Probabilistic Models

Bayes’ theorem meets the “naive” independence assumption — simple, fast, and surprisingly effective

arrow_backIndex

Advanced

calculate

Bayes’

arrow_forward

link_off

Naive

arrow_forward

bar_chart

Variants

arrow_forward

article

Text

arrow_forward

add_circle

Smoothing

arrow_forward

compare_arrows

Gen vs Disc

arrow_forward

emoji_events

When NB Wins

arrow_forward

code

Full Code

Click play or press Space to begin...

Step- / 8

calculate

Bayes’ Theorem: Prior, Likelihood, Posterior

Updating beliefs with evidence — the foundation of probabilistic reasoning

The Medical Test Analogy

A disease affects 1 in 1,000 people. A test is 99% accurate (99% true positive rate, 1% false positive rate). You test positive. What’s the probability you actually have the disease?

Most people guess ~99%. The real answer: ~9%. Why? Because the disease is so rare that the 1% false positives from 999 healthy people (≈10 false alarms) vastly outnumber the 1 true positive.

Bayes’ theorem formalizes this:

P(disease | positive) = P(positive | disease) × P(disease) / P(positive)

Prior: P(disease) = 0.001 — your belief before seeing evidence.
Likelihood: P(positive | disease) = 0.99 — how likely the evidence is given the hypothesis.
Evidence: P(positive) = 0.001×0.99 + 0.999×0.01 = 0.01089
Posterior: P(disease | positive) = 0.99 × 0.001 / 0.01089 = 0.091

Bayes’ Theorem Formula

P(class | features) = P(features | class) × P(class) ──────────────────────────────── P(features) Components: P(class) = Prior (base rate) P(features | class) = Likelihood P(features) = Evidence (normalizer) P(class | features) = Posterior (what we want) Medical test example: Prior: P(disease) = 0.001 Likelihood: P(+|disease) = 0.99 Evidence: P(+) = 0.001×0.99 + 0.999×0.01 = 0.01089 Posterior: P(disease|+) = 0.99×0.001/0.01089 = 0.091 (9.1%!) # The prior matters enormously. A rare disease # stays unlikely even with a positive test.

Key insight: Bayes’ theorem is like updating a weather forecast. Your prior is “20% chance of rain” (base rate for the season). You see dark clouds (evidence). The likelihood of dark clouds given rain is high. Bayes’ theorem combines these to give you an updated forecast: “75% chance of rain.” The prior anchors your belief; the evidence shifts it.

link_off

The “Naive” Independence Assumption

Why it’s wrong, and why it works anyway

The Problem It Solves

To classify an email with 10,000 words, we need P(word₁, word₂, …, word₁₀₀₀₀ | spam). Estimating this joint probability requires more data than exists in the universe — the number of possible word combinations is astronomical.

The naive assumption: features are conditionally independent given the class. This means:

P(x₁, x₂, …, xₙ | y) = ∏ P(xᵢ | y)

Instead of one impossibly complex joint distribution, we estimate d simple 1D distributions. For 10,000 words, that’s 10,000 simple probabilities instead of one 10,000-dimensional one.

Is it true? Almost never. “Free” and “money” in an email are clearly correlated. But Naive Bayes doesn’t need the assumption to be true — it just needs the resulting decision boundary to be approximately correct. The ranking of classes is often right even when the probabilities are miscalibrated.

Why It Works Despite Being Wrong

Without naive assumption (joint): P(x₁, x₂, ..., x₁₀₀₀₀ | spam) Parameters needed: 2¹⁰⁰⁰⁰ ≈ ∞ With naive assumption (factored): P(x₁|spam) × P(x₂|spam) × ... × P(x₁₀₀₀₀|spam) Parameters needed: 10,000 × 2 = 20,000 Classification rule: ŷ = argmax P(y) × ∏ P(xᵢ | y) y # We don't need P(features) — it's the same # for all classes, so it cancels out. In log space (avoids underflow): ŷ = argmax [log P(y) + Σ log P(xᵢ | y)] y # The independence assumption is wrong, but: # 1. It dramatically reduces parameters # 2. The class ranking is often correct # 3. With enough data, it converges fast

Key insight: The naive assumption is like a weather forecaster who treats temperature and humidity as independent. They’re clearly correlated, but the forecaster still predicts rain vs sunshine correctly most of the time. The simplification loses some nuance in the probabilities but preserves the decision — which is what classification cares about.

bar_chart

Gaussian, Multinomial, and Bernoulli NB

Three variants for three types of data

Gaussian Naive Bayes

Assumes each feature follows a normal distribution within each class:

P(xᵢ | y) = (1/√(2πσ²)) · exp(−(xᵢ − μ)² / 2σ²)

For each class y and feature i, estimate μ (mean) and σ (standard deviation) from training data. Best for continuous features like height, weight, temperature.

Multinomial Naive Bayes

Models count data using the multinomial distribution. P(xᵢ | y) is proportional to how often feature i appears in class y. Best for text classification with word counts or TF-IDF vectors. The go-to for spam filtering, sentiment analysis, and document categorization.

Bernoulli Naive Bayes

Models binary features (present/absent). Each feature is a coin flip: P(xᵢ = 1 | y). Best for binary feature vectors like “does the email contain the word ‘free’?” (yes/no). Unlike Multinomial, it explicitly penalizes the absence of features.

Choosing the Right Variant

from sklearn.naive_bayes import ( GaussianNB, MultinomialNB, BernoulliNB ) Gaussian: continuous features GaussianNB() # Iris (sepal/petal measurements) # Medical data (blood pressure, BMI) Multinomial: word counts / TF-IDF MultinomialNB(alpha=1.0) # Spam detection, news categorization # Sentiment analysis Bernoulli: binary features BernoulliNB(alpha=1.0) # Short text (does word appear? yes/no) # Binary feature vectors Decision guide: Continuous data → GaussianNB Word counts → MultinomialNB Binary flags → BernoulliNB Not sure? → Try MultinomialNB first (most versatile for text)

Key insight: The three variants are like three different measuring tools. Gaussian uses a ruler (continuous measurements). Multinomial uses a tally counter (how many times each word appears). Bernoulli uses a checklist (is each word present or absent?). Pick the tool that matches your data type.

article

Text Classification: Bag-of-Words & TF-IDF

Turning words into numbers — the classic NLP pipeline

Bag-of-Words

Bag-of-Words (BoW) represents a document as a vector of word counts, ignoring word order. “The cat sat on the mat” becomes {the: 2, cat: 1, sat: 1, on: 1, mat: 1}.

With a vocabulary of V words, each document becomes a V-dimensional vector. Most entries are zero (sparse). A corpus of 10,000 emails with a 50,000-word vocabulary produces a [10,000 × 50,000] matrix.

TF-IDF (Term Frequency × Inverse Document Frequency) improves on raw counts by downweighting common words:

TF-IDF(t, d) = TF(t, d) × log(N / DF(t))

Words like “the” appear everywhere (high DF), so their IDF is low. Words like “viagra” appear in few documents (low DF), so their IDF is high — making them more discriminative for spam detection.

TF-IDF + Naive Bayes Pipeline

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import make_pipeline # Raw text → TF-IDF → Naive Bayes pipe = make_pipeline( TfidfVectorizer( max_features=10000, # top 10K words stop_words='english', ngram_range=(1, 2), # unigrams + bigrams ), MultinomialNB(alpha=0.1) ) # Train on raw text strings pipe.fit(train_texts, train_labels) predictions = pipe.predict(test_texts) # Example: spam classification # "Congratulations! You won free money!" # → TF-IDF: {congratulations: 0.4, won: 0.3, # free: 0.5, money: 0.5, ...} # → P(spam) × P(free|spam) × P(money|spam) × ... # → P(spam|text) = 0.97 → SPAM

Key insight: TF-IDF + Naive Bayes is the “hello world” of NLP. It’s been classifying spam since the late 1990s and still works remarkably well. The combination is fast (trains in seconds on millions of emails), interpretable (you can see which words drive the prediction), and requires no GPU. For many text classification tasks, it’s all you need.

add_circle

Laplace Smoothing

Fixing the zero-frequency problem — what happens when a word never appeared in training?

The Zero Problem

Suppose the word “cryptocurrency” never appeared in any spam email during training. Then P(“cryptocurrency” | spam) = 0.

Since Naive Bayes multiplies all feature probabilities, one zero kills the entire product: P(spam | email) = 0, regardless of how many other spam-like words appear. A single unseen word vetoes the entire classification.

Laplace smoothing (additive smoothing) fixes this by adding a small count α to every feature:

P(word | class) = (count(word, class) + α) / (total_words_in_class + α × V)

where V is the vocabulary size. With α = 1, every word gets at least 1 “phantom” observation. This ensures no probability is ever exactly zero.

scikit-learn’s alpha parameter controls this. Default is 1.0. Smaller values (0.01–0.1) often work better in practice — tune via cross-validation.

Smoothing Example

Without smoothing (α=0): Spam emails: 1000 total words "free" appears 50 times "crypto" appears 0 times P(free|spam) = 50/1000 = 0.050 P(crypto|spam) = 0/1000 = 0.000 ← kills everything! P(spam|email) ∝ ... × 0.050 × 0.000 × ... = 0 (regardless of other words) With Laplace smoothing (α=1, V=50000): P(free|spam) = (50+1)/(1000+50000) = 51/51000 = 0.00100 P(crypto|spam) = (0+1)/(1000+50000) = 1/51000 = 0.00002 ← small but not zero! # Now the product survives. # "crypto" contributes a tiny probability # instead of vetoing the entire classification.

Key insight: Laplace smoothing is like giving every restaurant at least one review. Without it, a new restaurant with zero reviews gets a rating of 0/0 (undefined). With smoothing, it gets 1 phantom review, giving it a small but nonzero rating. This prevents a single missing data point from crashing the entire system.

compare_arrows

Generative vs Discriminative Models

Two fundamentally different approaches to classification

Two Philosophies

Generative models (Naive Bayes) learn the full joint distribution P(x, y) = P(x|y) × P(y). They model how the data was generated for each class, then use Bayes’ theorem to classify.

Discriminative models (Logistic Regression, SVM, Trees) learn P(y|x) directly. They only care about the decision boundary between classes, not how the data was generated.

Analogy: To tell cats from dogs, a generative model learns “what does a typical cat look like?” and “what does a typical dog look like?” then compares. A discriminative model learns “what features distinguish cats from dogs?” directly.

Discriminative models are generally more accurate because they focus on what matters (the boundary). But generative models have advantages: they handle missing data naturally, work well with very small training sets, and can generate synthetic data.

Generative (Naive Bayes)

Models: P(x|y) and P(y)
Learns: how data is generated per class
Pros: fast training, handles missing data, works with tiny datasets, can generate samples
Cons: independence assumption, less accurate boundaries

Discriminative (LogReg)

Models: P(y|x) directly
Learns: the decision boundary
Pros: more accurate boundaries, fewer assumptions, better calibrated probabilities
Cons: needs more data, can’t handle missing features easily

The Ng & Jordan Result

Andrew Ng & Michael Jordan (2001) showed: Small data (n < ~100): Naive Bayes often beats Logistic Regression # Fewer parameters → less overfitting Large data (n > ~1000): Logistic Regression usually wins # More flexible → better boundary Naive Bayes converges to its (biased) asymptote in O(log n) samples. Logistic Regression converges to the true boundary in O(n) samples. # NB learns fast but plateaus early. # LogReg learns slower but reaches higher.

Key insight: Generative vs discriminative is like two ways to identify a painting. The generative approach studies Monet’s style deeply (how he paints water, light, brushstrokes) and Picasso’s style separately, then compares. The discriminative approach just learns “Monet uses soft edges, Picasso uses sharp angles” — the minimum needed to tell them apart.

emoji_events

When Naive Bayes Beats Complex Models

Small data, many features, real-time prediction — NB’s sweet spots

NB’s Advantages

1. Very small training sets: With 50–100 labeled examples, NB often outperforms logistic regression, SVMs, and trees. Its simple model has few parameters to estimate, so it doesn’t overfit.

2. Very high-dimensional data: Text with 50,000+ word features. NB handles this naturally because it estimates each feature independently. No matrix inversions, no curse of dimensionality.

3. Real-time prediction: NB prediction is a simple sum of log-probabilities — O(d) per sample. No matrix multiplications, no tree traversals. It can classify millions of documents per second.

4. Incremental learning: NB can update its model with new data without retraining from scratch (partial_fit() in scikit-learn). Useful for streaming data.

5. Multi-class is natural: NB handles K classes with no extra machinery — just compute P(y=k | x) for each class. No one-vs-rest needed.

NB’s Limitations

1. Probability calibration: NB probabilities are often poorly calibrated — it tends to push probabilities toward 0 and 1. The rankings are usually correct, but the actual probability values are unreliable. Use CalibratedClassifierCV if you need calibrated probabilities.

2. Feature correlations: Highly correlated features get double-counted. If “free” and “money” always appear together, NB treats them as independent evidence, overestimating their combined effect.

3. Continuous features: GaussianNB assumes normal distributions. If features are multimodal or heavily skewed, the Gaussian assumption fails. Consider discretizing features or using kernel density estimation.

Key insight: Naive Bayes is the “good enough, fast enough” classifier. It won’t win a Kaggle competition, but it’ll give you a working baseline in minutes, handle millions of documents without breaking a sweat, and often be within 1–3% of more complex models. In production, “fast and 95% accurate” often beats “slow and 97% accurate.”

code

Complete NB Pipeline: 20 Newsgroups

Text classification on 20 news categories — end to end

20 Newsgroups Classification

from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import make_pipeline from sklearn.metrics import classification_report # Load 4 categories for clarity cats = ['sci.space', 'rec.sport.hockey', 'comp.graphics', 'talk.politics.guns'] train = fetch_20newsgroups(subset='train', categories=cats) test = fetch_20newsgroups(subset='test', categories=cats) pipe = make_pipeline( TfidfVectorizer(max_features=20000, stop_words='english'), MultinomialNB(alpha=0.1) ) pipe.fit(train.data, train.target) y_pred = pipe.predict(test.data) print(classification_report( test.target, y_pred, target_names=train.target_names ))

Expected Results

4-class newsgroup classification: precision recall f1 comp.graphics 0.93 0.88 0.91 rec.sport.hockey 0.97 0.97 0.97 sci.space 0.95 0.96 0.96 talk.politics.guns 0.93 0.96 0.94 accuracy 0.95 Training time: < 1 second Top words per class: sci.space: nasa, orbit, shuttle, launch rec.hockey: game, team, hockey, season comp.graphics: image, graphics, file, jpeg politics.guns: gun, firearms, weapons, amendment # 95% accuracy in under 1 second. # No GPU, no deep learning, no embeddings. # TF-IDF + Naive Bayes: the unsung hero of NLP.

Key insight: TF-IDF + MultinomialNB achieves 95% accuracy on 4-class news classification in under 1 second. It’s the fastest path from raw text to working classifier. Start here, measure the accuracy, and only reach for BERT or transformers if you need that last 3–4%. For many production systems, Naive Bayes is the final model, not just the baseline.

arrow_back Support Vector Machines Clustering arrow_forward