Why Not MSE?
For linear regression, MSE creates a smooth bowl with one minimum. But with the sigmoid, MSE becomes non-convex — full of local minima where gradient descent gets stuck.
Binary cross-entropy (log loss) fixes this. For a single sample:
L = −[y·log(p) + (1−y)·log(1−p)]
where y ∈ {0, 1} is the true label and p = σ(wᵀx + b) is the predicted probability.
When y = 1: L = −log(p). If p = 0.99, loss = 0.01 (great). If p = 0.01, loss = 4.6 (terrible). The log creates an infinite penalty for confidently wrong predictions.
When y = 0: L = −log(1−p). Same logic, mirrored.
Over all n samples: L = −(1/n) ∑ [yᵢ·log(pᵢ) + (1−yᵢ)·log(1−pᵢ)]
This loss is convex with the sigmoid, guaranteeing a single global minimum.
Log Loss Behavior
True label y=1 (actual positive):
Predict p=0.99 → loss = -log(0.99) = 0.01 ✓
Predict p=0.80 → loss = -log(0.80) = 0.22
Predict p=0.50 → loss = -log(0.50) = 0.69
Predict p=0.10 → loss = -log(0.10) = 2.30 ✗
Predict p=0.01 → loss = -log(0.01) = 4.61 ✗✗
True label y=0 (actual negative):
Predict p=0.01 → loss = -log(0.99) = 0.01 ✓
Predict p=0.50 → loss = -log(0.50) = 0.69
Predict p=0.99 → loss = -log(0.01) = 4.61 ✗✗
# The asymmetry is the point:
# Being 99% confident AND wrong costs 460x
# more than being 99% confident and right.
# This forces the model to be well-calibrated.
Key insight: Log loss is like a lie detector for confidence. If you say “I’m 99% sure this is spam” and it’s not, you get hammered. If you say “I’m 51% sure,” the penalty is mild. This forces the model to only be confident when it has strong evidence — producing well-calibrated probabilities, not just correct labels.