How It Works
Naive Bayes applies Bayes' theorem with a "naive" assumption: all features (words) are conditionally independent given the class. This assumption is obviously wrong — "New" and "York" are highly correlated — but the model works remarkably well in practice. For text, Multinomial Naive Bayes models word counts: P(class | document) is proportional to P(class) × ∏ P(word | class). Training is just counting: how often does each word appear in documents of each class? Prediction is fast: multiply the probabilities. Naive Bayes excels with small training sets (as few as dozens of examples), is extremely fast to train and predict, and provides a strong baseline that more complex models must beat. It's still the go-to first model for text classification prototypes.
Naive Bayes for Text
Bayes' theorem:
P(spam | "free money now") =
P("free"|spam) × P("money"|spam) ×
P("now"|spam) × P(spam)
÷ P("free money now")
Training = counting:
P("free" | spam) = count("free" in spam)
/ total words in spam
P("free" | ham) = count("free" in ham)
/ total words in ham
Strengths:
Works with tiny datasets (50+ examples)
Trains in milliseconds
Interpretable: which words drive predictions
Hard to overfit
Weaknesses:
Independence assumption is wrong
Can't capture word interactions
Sensitive to word frequency imbalance
Key insight: Naive Bayes is the baseline that refuses to die. Its independence assumption is wrong, but the ranking of class probabilities is often correct. Always start with Naive Bayes — if a complex model can't beat it, the problem is your data, not your model.