Ch 5 — Support Vector Machines

Finding the widest street between classes — margins, kernels, and the dual problem
Advanced
straighten
Max Margin
arrow_forward
push_pin
Support Vec
arrow_forward
functions
Dual Problem
arrow_forward
tune
Soft Margin
arrow_forward
auto_awesome
Kernel Trick
arrow_forward
blur_on
RBF Kernel
arrow_forward
trending_flat
SVR
arrow_forward
code
Full Code
-
Click play or press Space to begin...
Step- / 8
straighten
Maximum Margin Classifier
The widest street analogy — find the boundary with the most breathing room
The Widest Street
Logistic regression finds a boundary between classes. SVMs find the best boundary — the one with the maximum margin.

Imagine two groups of houses on opposite sides of a street. Many streets could separate them, but the SVM picks the widest street possible. The wider the street, the more confident we are that new houses will be correctly classified.

Mathematically, the margin is the distance from the decision boundary to the nearest data point on either side. The SVM maximizes this distance:

maximize 2/||w|| subject to yᵢ(wᵀxᵢ + b) ≥ 1 for all i

The constraint says every point must be on the correct side of the margin. The objective says make the margin as wide as possible. This is a constrained optimization problem — specifically, a convex quadratic program with a unique global solution.
Margin Geometry
Decision boundary: wᵀx + b = 0 Positive margin: wᵀx + b = +1 Negative margin: wᵀx + b = -1 Margin width: 2 / ||w|| Optimization: maximize 2/||w|| ≡ minimize ½||w||² (easier to optimize) subject to: yᵢ(wᵀxᵢ + b) ≥ 1 ∀i Example (2D): w = [2, 1], b = -3 ||w|| = √(4+1) = 2.24 Margin width = 2/2.24 = 0.89 vs. w = [1, 0.5], b = -1.5 (same direction) ||w|| = 1.12 Margin width = 2/1.12 = 1.79 ← wider! ✓
Key insight: The SVM is like a city planner designing the widest possible boulevard between two neighborhoods. A narrow alley (small margin) means any slight shift puts you in the wrong neighborhood. A wide boulevard (large margin) gives plenty of buffer. The SVM finds the widest boulevard that still separates everyone correctly.
push_pin
Support Vectors: The Critical Points
Only the points on the margin matter — everything else is irrelevant
What Are Support Vectors?
Support vectors are the data points that sit exactly on the margin boundary (where yᵢ(wᵀxᵢ + b) = 1). They are the closest points to the decision boundary from each class.

The remarkable property: the decision boundary depends only on the support vectors. You could remove every other data point and the boundary wouldn’t change. In a dataset of 10,000 points, there might be only 50–200 support vectors.

This makes SVMs memory-efficient at prediction time — you only need to store the support vectors. It also means SVMs are robust to outliers that are far from the boundary (they don’t affect the solution).

However, SVMs are sensitive to points near the boundary. Moving a single support vector can completely change the decision boundary. This is the opposite of logistic regression, where every point contributes to the solution.
Support Vectors in Practice
from sklearn.svm import SVC svm = SVC(kernel='linear', C=1.0) svm.fit(X_train, y_train) # How many support vectors? print(f"Total samples: {len(X_train)}") print(f"Support vectors: {len(svm.support_vectors_)}") print(f"Per class: {svm.n_support_}") # Typical output: # Total samples: 455 # Support vectors: 87 (19% of data) # Per class: [42, 45] # The boundary is defined by just 87 points. # Remove the other 368 → same boundary. # Move one support vector → boundary shifts. Weights and intercept: svm.coef_ # w vector (linear kernel only) svm.intercept_ # b scalar
Key insight: Support vectors are like the load-bearing walls of a building. Remove any non-load-bearing wall and the building stands. Remove a load-bearing wall and it collapses. The SVM identifies which data points are structurally critical to the decision boundary and ignores the rest.
functions
The Dual Problem & Lagrange Multipliers
Reformulating the optimization to unlock the kernel trick
Why the Dual?
The primal problem minimizes ½||w||² subject to constraints. Using Lagrange multipliers αᵢ ≥ 0 (one per constraint), we form the Lagrangian:

L = ½||w||² − ∑ αᵢ[yᵢ(wᵀxᵢ + b) − 1]

Taking derivatives and substituting gives the dual problem:

maximize ∑αᵢ − ½∑∑ αᵢαᵤ yᵢyᵤ (xᵢᵀxᵤ)
subject to αᵢ ≥ 0 and ∑αᵢyᵢ = 0

The dual depends on data only through dot products xᵢᵀxᵤ. This is the key insight that enables the kernel trick.

At the solution, αᵢ > 0 only for support vectors. All other αᵢ = 0. The decision function becomes:

f(x) = sign(∑ αᵢ yᵢ (xᵢᵀx) + b)

— a weighted sum over only the support vectors.
The Math at a Glance
Primal: min ½||w||² s.t. yᵢ(wᵀxᵢ + b) ≥ 1 Lagrangian: L = ½||w||² - Σ αᵢ[yᵢ(wᵀxᵢ + b) - 1] KKT conditions give: w = Σ αᵢyᵢxᵢ (w is a linear combo of data) Σ αᵢyᵢ = 0 Dual: max Σαᵢ - ½ΣΣ αᵢαⱼyᵢyⱼ(xᵢᵀxⱼ) s.t. αᵢ ≥ 0, Σαᵢyᵢ = 0 Prediction: f(x) = sign(Σ αᵢyᵢ(xᵢᵀx) + b) ↑ Only dot products matter! Replace with K(xᵢ, x) → kernel trick
Key insight: The dual reformulation reveals that the SVM only cares about dot products between data points, never the raw coordinates. This is like a judge who evaluates contestants only by how similar they are to each other, not by their individual attributes. Replacing dot products with kernel functions lets us operate in infinite-dimensional spaces without ever computing coordinates there.
tune
Soft Margin: The C Parameter
Real data isn’t perfectly separable — allow some mistakes
Why Soft Margins?
The hard-margin SVM requires perfect separation: every point on the correct side. But real data often has overlapping classes or outliers that make perfect separation impossible (or undesirable).

The soft-margin SVM introduces slack variables ξᵢ that allow points to violate the margin or even be on the wrong side:

minimize ½||w||² + C · ∑ξᵢ
subject to yᵢ(wᵀxᵢ + b) ≥ 1 − ξᵢ, ξᵢ ≥ 0

The C parameter controls the trade-off:

Large C (e.g., 1000): “Misclassifications are very expensive.” The SVM tries hard to classify every training point correctly, even if the margin is narrow. Risk: overfitting.

Small C (e.g., 0.01): “A wide margin is more important than getting every point right.” The SVM tolerates more misclassifications for a wider, more generalizable margin. Risk: underfitting.
C Parameter Effects
from sklearn.svm import SVC from sklearn.model_selection import cross_val_score for C in [0.001, 0.01, 0.1, 1, 10, 100, 1000]: svm = SVC(kernel='rbf', C=C) scores = cross_val_score(svm, X, y, cv=5) print(f"C={C:7.3f} acc={scores.mean():.3f}") # Typical results: # C= 0.001 acc=0.640 (too much slack, underfit) # C= 0.010 acc=0.820 # C= 0.100 acc=0.910 # C= 1.000 acc=0.950 ← sweet spot # C= 10.000 acc=0.945 # C=100.000 acc=0.935 (too rigid, overfit) # C=1000.00 acc=0.920 Slack variable interpretation: ξᵢ = 0: correctly classified, outside margin 0 < ξᵢ < 1: inside margin but correct side ξᵢ = 1: exactly on the boundary ξᵢ > 1: misclassified (wrong side)
Key insight: C is like a strictness dial for a teacher. C = 1000 is a teacher who fails you for one wrong answer (narrow margin, overfits to noise). C = 0.01 is a teacher who passes everyone (wide margin, ignores real patterns). The best C is the teacher who’s strict enough to maintain standards but flexible enough to forgive honest mistakes.
auto_awesome
The Kernel Trick
Mapping to higher dimensions without computing them — the magic of SVMs
The Problem and the Trick
Some data isn’t linearly separable in its original space. Imagine two concentric circles: inner circle = class A, outer ring = class B. No straight line can separate them.

Solution: Map the data to a higher-dimensional space where a linear separator exists. For the circles, adding a feature z = x² + y² lifts the inner circle above the outer ring, making them linearly separable in 3D.

The kernel trick exploits the fact that the SVM dual only needs dot products xᵢᵀxᵤ. Instead of explicitly computing the high-dimensional mapping φ(x), we define a kernel function K(xᵢ, xᵤ) = φ(xᵢ)ᵀφ(xᵤ) that computes the dot product in the high-dimensional space directly from the original coordinates.

The RBF kernel implicitly maps to an infinite-dimensional space, yet computing K(xᵢ, xᵤ) costs only O(d) — the same as a regular dot product. This is why it’s called a “trick.”
Common Kernels
Linear: K(x, z) = xᵀz No mapping. Same as logistic regression boundary. Use when: data is linearly separable, d >> n. Polynomial: K(x, z) = (γ·xᵀz + r)^d Maps to all monomials up to degree d. degree=2: captures x₁², x₁x₂, x₂² Use when: polynomial relationships exist. RBF (Gaussian): K(x, z) = exp(-γ||x - z||²) Maps to infinite-dimensional space. γ controls influence radius of each point. Default and most versatile kernel. Sigmoid: K(x, z) = tanh(γ·xᵀz + r) Similar to neural network activation. Rarely used in practice. # scikit-learn default: kernel='rbf' # RBF works well for most nonlinear problems.
Key insight: The kernel trick is like having X-ray vision. In 2D, two groups might be hopelessly intertwined. But the kernel lets you “see” them in a higher dimension where they’re clearly separated — without actually computing those higher dimensions. It’s the mathematical equivalent of tilting a picture to see a hidden 3D image.
blur_on
RBF Kernel: Gamma and C Tuning
The two knobs that control the RBF SVM’s behavior
What Gamma Controls
The RBF kernel K(x, z) = exp(−γ||x − z||²) has a parameter γ that controls the influence radius of each support vector:

Small γ (e.g., 0.001): Each support vector influences a wide area. The decision boundary is smooth and simple. Risk: underfitting (too smooth to capture patterns).

Large γ (e.g., 100): Each support vector influences only its immediate neighbors. The boundary becomes jagged, wrapping tightly around each point. Risk: overfitting (memorizing individual points).

scikit-learn default: gamma='scale' = 1 / (n_features × X.var()), which adapts to the data scale.

C and γ interact: Large C + large γ = severe overfitting. Small C + small γ = severe underfitting. Use GridSearchCV to find the optimal combination.
Grid Search for C and Gamma
from sklearn.svm import SVC from sklearn.model_selection import GridSearchCV from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline pipe = make_pipeline(StandardScaler(), SVC()) param_grid = { 'svc__C': [0.1, 1, 10, 100], 'svc__gamma': [0.001, 0.01, 0.1, 1], 'svc__kernel': ['rbf'], } grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1) grid.fit(X_train, y_train) print(f"Best: {grid.best_params_}") print(f"Best CV acc: {grid.best_score_:.3f}") print(f"Test acc: {grid.score(X_test, y_test):.3f}") # Typical best: C=10, gamma=0.01 # Always scale features first! SVMs are # sensitive to feature magnitudes.
Key insight: Gamma is like the zoom level on a microscope. Low gamma sees the big picture (smooth boundary). High gamma zooms into individual cells (jagged boundary). C is how much you penalize errors. The best SVM comes from finding the right zoom level and the right strictness — always via cross-validated grid search.
trending_flat
SVM for Regression (SVR)
Fitting a tube around the data instead of a margin between classes
From Classification to Regression
Support Vector Regression (SVR) flips the SVM idea: instead of maximizing the margin between classes, it fits a tube of width ε around the data and tries to fit as many points inside the tube as possible.

Points inside the tube contribute zero loss. Points outside the tube are penalized linearly by their distance from the tube edge. This is the ε-insensitive loss:

L = max(0, |y − f(x)| − ε)

The ε parameter controls tube width. Larger ε = more points inside the tube = simpler model. Smaller ε = tighter fit = more support vectors.

SVR inherits all the SVM benefits: kernel trick for nonlinear regression, sparse solution (only support vectors matter), and regularization via C.
SVR in scikit-learn
from sklearn.svm import SVR from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline from sklearn.metrics import r2_score svr = make_pipeline( StandardScaler(), SVR(kernel='rbf', C=100, epsilon=0.1) ) svr.fit(X_train, y_train) y_pred = svr.predict(X_test) print(f"R²: {r2_score(y_test, y_pred):.3f}") Parameters: C: regularization (same as SVC) epsilon: tube width (ε-insensitive zone) kernel: rbf, linear, poly (same as SVC) # SVR is powerful for small-medium datasets # with nonlinear relationships. For large # datasets (>10K), use tree-based regressors # — SVR's O(n²) training is too slow.
Key insight: SVR is like drawing a tube around a roller coaster track. The tube width ε defines how much wobble you tolerate. Points inside the tube are “close enough” and contribute zero error. Only the points poking outside the tube (support vectors) shape the model. This makes SVR robust to small noise while fitting the overall trend.
code
Complete SVM Pipeline
Kernel comparison on real data with grid search tuning
Kernel Comparison: Digits Dataset
from sklearn.datasets import load_digits from sklearn.model_selection import ( train_test_split, cross_val_score ) from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC from sklearn.pipeline import make_pipeline # 1,797 handwritten digits, 64 features (8x8 px) X, y = load_digits(return_X_y=True) X_tr, X_te, y_tr, y_te = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42 ) kernels = { "Linear": SVC(kernel='linear', C=1), "Poly (d=3)": SVC(kernel='poly', degree=3, C=1), "RBF": SVC(kernel='rbf', C=10, gamma=0.001), } for name, svm in kernels.items(): pipe = make_pipeline(StandardScaler(), svm) cv = cross_val_score(pipe, X_tr, y_tr, cv=5) pipe.fit(X_tr, y_tr) test_acc = pipe.score(X_te, y_te) print(f"{name:12s} CV={cv.mean():.3f} " f"Test={test_acc:.3f} " f"SVs={svm.fit(StandardScaler().fit_transform(X_tr), y_tr).n_support_.sum()}")
Expected Results
Digits (1,797 samples, 64 features, 10 classes): Linear CV=0.977 Test=0.978 SVs=280 Poly (d=3) CV=0.990 Test=0.989 SVs=350 RBF CV=0.991 Test=0.992 SVs=420 Key takeaways: • RBF wins: 99.2% on handwritten digits • Linear is surprisingly good (97.8%) • More SVs = more complex boundary • Always scale features (StandardScaler) • Grid search C and gamma for best results When to use SVMs: ✓ Small-medium data (<10K samples) ✓ High-dimensional features (text, images) ✓ Clear margin of separation exists ✗ Large data (>50K) — too slow (O(n²-n³)) ✗ Need probability estimates (SVM gives scores) ✗ Need interpretability (kernels are opaque)
Key insight: SVMs achieve 99.2% accuracy on digit recognition with just 1,797 training samples — no deep learning needed. They excel when the number of features is large relative to samples (the “d >> n” regime). But for datasets over ~10K samples, SVMs’ O(n²) training time makes them impractical. Use tree-based methods or neural networks for large-scale problems.