Ch 5: Support Vector Machines

Ch 5 — Support Vector Machines

Finding the widest street between classes — margins, kernels, and the dual problem

arrow_backIndex

Advanced

straighten

Max Margin

arrow_forward

push_pin

Support Vec

arrow_forward

functions

Dual Problem

arrow_forward

tune

Soft Margin

arrow_forward

auto_awesome

Kernel Trick

arrow_forward

blur_on

RBF Kernel

arrow_forward

trending_flat

SVR

arrow_forward

code

Full Code

Click play or press Space to begin...

Step- / 8

straighten

Maximum Margin Classifier

The widest street analogy — find the boundary with the most breathing room

The Widest Street

Logistic regression finds a boundary between classes. SVMs find the best boundary — the one with the maximum margin.

Imagine two groups of houses on opposite sides of a street. Many streets could separate them, but the SVM picks the widest street possible. The wider the street, the more confident we are that new houses will be correctly classified.

Mathematically, the margin is the distance from the decision boundary to the nearest data point on either side. The SVM maximizes this distance:

maximize 2/||w|| subject to yᵢ(wᵀxᵢ + b) ≥ 1 for all i

The constraint says every point must be on the correct side of the margin. The objective says make the margin as wide as possible. This is a constrained optimization problem — specifically, a convex quadratic program with a unique global solution.

Margin Geometry

Decision boundary: wᵀx + b = 0 Positive margin: wᵀx + b = +1 Negative margin: wᵀx + b = -1 Margin width: 2 / ||w|| Optimization: maximize 2/||w|| ≡ minimize ½||w||² (easier to optimize) subject to: yᵢ(wᵀxᵢ + b) ≥ 1 ∀i Example (2D): w = [2, 1], b = -3 ||w|| = √(4+1) = 2.24 Margin width = 2/2.24 = 0.89 vs. w = [1, 0.5], b = -1.5 (same direction) ||w|| = 1.12 Margin width = 2/1.12 = 1.79 ← wider! ✓

Key insight: The SVM is like a city planner designing the widest possible boulevard between two neighborhoods. A narrow alley (small margin) means any slight shift puts you in the wrong neighborhood. A wide boulevard (large margin) gives plenty of buffer. The SVM finds the widest boulevard that still separates everyone correctly.

push_pin

Support Vectors: The Critical Points

Only the points on the margin matter — everything else is irrelevant

What Are Support Vectors?

Support vectors are the data points that sit exactly on the margin boundary (where yᵢ(wᵀxᵢ + b) = 1). They are the closest points to the decision boundary from each class.

The remarkable property: the decision boundary depends only on the support vectors. You could remove every other data point and the boundary wouldn’t change. In a dataset of 10,000 points, there might be only 50–200 support vectors.

This makes SVMs memory-efficient at prediction time — you only need to store the support vectors. It also means SVMs are robust to outliers that are far from the boundary (they don’t affect the solution).

However, SVMs are sensitive to points near the boundary. Moving a single support vector can completely change the decision boundary. This is the opposite of logistic regression, where every point contributes to the solution.

Support Vectors in Practice

from sklearn.svm import SVC svm = SVC(kernel='linear', C=1.0) svm.fit(X_train, y_train) # How many support vectors? print(f"Total samples: {len(X_train)}") print(f"Support vectors: {len(svm.support_vectors_)}") print(f"Per class: {svm.n_support_}") # Typical output: # Total samples: 455 # Support vectors: 87 (19% of data) # Per class: [42, 45] # The boundary is defined by just 87 points. # Remove the other 368 → same boundary. # Move one support vector → boundary shifts. Weights and intercept: svm.coef_ # w vector (linear kernel only) svm.intercept_ # b scalar

Key insight: Support vectors are like the load-bearing walls of a building. Remove any non-load-bearing wall and the building stands. Remove a load-bearing wall and it collapses. The SVM identifies which data points are structurally critical to the decision boundary and ignores the rest.

functions

The Dual Problem & Lagrange Multipliers

Reformulating the optimization to unlock the kernel trick

Why the Dual?

The primal problem minimizes ½||w||² subject to constraints. Using Lagrange multipliers αᵢ ≥ 0 (one per constraint), we form the Lagrangian:

L = ½||w||² − ∑ αᵢ[yᵢ(wᵀxᵢ + b) − 1]

Taking derivatives and substituting gives the dual problem:

maximize ∑αᵢ − ½∑∑ αᵢαᵤ yᵢyᵤ (xᵢᵀxᵤ)
subject to αᵢ ≥ 0 and ∑αᵢyᵢ = 0

The dual depends on data only through dot products xᵢᵀxᵤ. This is the key insight that enables the kernel trick.

At the solution, αᵢ > 0 only for support vectors. All other αᵢ = 0. The decision function becomes:

f(x) = sign(∑ αᵢ yᵢ (xᵢᵀx) + b)

— a weighted sum over only the support vectors.

The Math at a Glance

Primal: min ½||w||² s.t. yᵢ(wᵀxᵢ + b) ≥ 1 Lagrangian: L = ½||w||² - Σ αᵢ[yᵢ(wᵀxᵢ + b) - 1] KKT conditions give: w = Σ αᵢyᵢxᵢ (w is a linear combo of data) Σ αᵢyᵢ = 0 Dual: max Σαᵢ - ½ΣΣ αᵢαⱼyᵢyⱼ(xᵢᵀxⱼ) s.t. αᵢ ≥ 0, Σαᵢyᵢ = 0 Prediction: f(x) = sign(Σ αᵢyᵢ(xᵢᵀx) + b) ↑ Only dot products matter! Replace with K(xᵢ, x) → kernel trick

Key insight: The dual reformulation reveals that the SVM only cares about dot products between data points, never the raw coordinates. This is like a judge who evaluates contestants only by how similar they are to each other, not by their individual attributes. Replacing dot products with kernel functions lets us operate in infinite-dimensional spaces without ever computing coordinates there.

tune

Soft Margin: The C Parameter

Real data isn’t perfectly separable — allow some mistakes

Why Soft Margins?

The hard-margin SVM requires perfect separation: every point on the correct side. But real data often has overlapping classes or outliers that make perfect separation impossible (or undesirable).

The soft-margin SVM introduces slack variables ξᵢ that allow points to violate the margin or even be on the wrong side:

minimize ½||w||² + C · ∑ξᵢ
subject to yᵢ(wᵀxᵢ + b) ≥ 1 − ξᵢ, ξᵢ ≥ 0

The C parameter controls the trade-off:

Large C (e.g., 1000): “Misclassifications are very expensive.” The SVM tries hard to classify every training point correctly, even if the margin is narrow. Risk: overfitting.

Small C (e.g., 0.01): “A wide margin is more important than getting every point right.” The SVM tolerates more misclassifications for a wider, more generalizable margin. Risk: underfitting.

C Parameter Effects

from sklearn.svm import SVC from sklearn.model_selection import cross_val_score for C in [0.001, 0.01, 0.1, 1, 10, 100, 1000]: svm = SVC(kernel='rbf', C=C) scores = cross_val_score(svm, X, y, cv=5) print(f"C={C:7.3f} acc={scores.mean():.3f}") # Typical results: # C= 0.001 acc=0.640 (too much slack, underfit) # C= 0.010 acc=0.820 # C= 0.100 acc=0.910 # C= 1.000 acc=0.950 ← sweet spot # C= 10.000 acc=0.945 # C=100.000 acc=0.935 (too rigid, overfit) # C=1000.00 acc=0.920 Slack variable interpretation: ξᵢ = 0: correctly classified, outside margin 0 < ξᵢ < 1: inside margin but correct side ξᵢ = 1: exactly on the boundary ξᵢ > 1: misclassified (wrong side)

Key insight: C is like a strictness dial for a teacher. C = 1000 is a teacher who fails you for one wrong answer (narrow margin, overfits to noise). C = 0.01 is a teacher who passes everyone (wide margin, ignores real patterns). The best C is the teacher who’s strict enough to maintain standards but flexible enough to forgive honest mistakes.

auto_awesome

The Kernel Trick

Mapping to higher dimensions without computing them — the magic of SVMs

The Problem and the Trick

Some data isn’t linearly separable in its original space. Imagine two concentric circles: inner circle = class A, outer ring = class B. No straight line can separate them.

Solution: Map the data to a higher-dimensional space where a linear separator exists. For the circles, adding a feature z = x² + y² lifts the inner circle above the outer ring, making them linearly separable in 3D.

The kernel trick exploits the fact that the SVM dual only needs dot products xᵢᵀxᵤ. Instead of explicitly computing the high-dimensional mapping φ(x), we define a kernel function K(xᵢ, xᵤ) = φ(xᵢ)ᵀφ(xᵤ) that computes the dot product in the high-dimensional space directly from the original coordinates.

The RBF kernel implicitly maps to an infinite-dimensional space, yet computing K(xᵢ, xᵤ) costs only O(d) — the same as a regular dot product. This is why it’s called a “trick.”

Common Kernels

Linear: K(x, z) = xᵀz No mapping. Same as logistic regression boundary. Use when: data is linearly separable, d >> n. Polynomial: K(x, z) = (γ·xᵀz + r)^d Maps to all monomials up to degree d. degree=2: captures x₁², x₁x₂, x₂² Use when: polynomial relationships exist. RBF (Gaussian): K(x, z) = exp(-γ||x - z||²) Maps to infinite-dimensional space. γ controls influence radius of each point. Default and most versatile kernel. Sigmoid: K(x, z) = tanh(γ·xᵀz + r) Similar to neural network activation. Rarely used in practice. # scikit-learn default: kernel='rbf' # RBF works well for most nonlinear problems.

Key insight: The kernel trick is like having X-ray vision. In 2D, two groups might be hopelessly intertwined. But the kernel lets you “see” them in a higher dimension where they’re clearly separated — without actually computing those higher dimensions. It’s the mathematical equivalent of tilting a picture to see a hidden 3D image.

blur_on

RBF Kernel: Gamma and C Tuning

The two knobs that control the RBF SVM’s behavior

What Gamma Controls

The RBF kernel K(x, z) = exp(−γ||x − z||²) has a parameter γ that controls the influence radius of each support vector:

Small γ (e.g., 0.001): Each support vector influences a wide area. The decision boundary is smooth and simple. Risk: underfitting (too smooth to capture patterns).

Large γ (e.g., 100): Each support vector influences only its immediate neighbors. The boundary becomes jagged, wrapping tightly around each point. Risk: overfitting (memorizing individual points).

scikit-learn default: gamma='scale' = 1 / (n_features × X.var()), which adapts to the data scale.

C and γ interact: Large C + large γ = severe overfitting. Small C + small γ = severe underfitting. Use GridSearchCV to find the optimal combination.

Grid Search for C and Gamma

from sklearn.svm import SVC from sklearn.model_selection import GridSearchCV from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline pipe = make_pipeline(StandardScaler(), SVC()) param_grid = { 'svc__C': [0.1, 1, 10, 100], 'svc__gamma': [0.001, 0.01, 0.1, 1], 'svc__kernel': ['rbf'], } grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1) grid.fit(X_train, y_train) print(f"Best: {grid.best_params_}") print(f"Best CV acc: {grid.best_score_:.3f}") print(f"Test acc: {grid.score(X_test, y_test):.3f}") # Typical best: C=10, gamma=0.01 # Always scale features first! SVMs are # sensitive to feature magnitudes.

Key insight: Gamma is like the zoom level on a microscope. Low gamma sees the big picture (smooth boundary). High gamma zooms into individual cells (jagged boundary). C is how much you penalize errors. The best SVM comes from finding the right zoom level and the right strictness — always via cross-validated grid search.

trending_flat

SVM for Regression (SVR)

Fitting a tube around the data instead of a margin between classes

From Classification to Regression

Support Vector Regression (SVR) flips the SVM idea: instead of maximizing the margin between classes, it fits a tube of width ε around the data and tries to fit as many points inside the tube as possible.

Points inside the tube contribute zero loss. Points outside the tube are penalized linearly by their distance from the tube edge. This is the ε-insensitive loss:

L = max(0, |y − f(x)| − ε)

The ε parameter controls tube width. Larger ε = more points inside the tube = simpler model. Smaller ε = tighter fit = more support vectors.

SVR inherits all the SVM benefits: kernel trick for nonlinear regression, sparse solution (only support vectors matter), and regularization via C.

SVR in scikit-learn

from sklearn.svm import SVR from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline from sklearn.metrics import r2_score svr = make_pipeline( StandardScaler(), SVR(kernel='rbf', C=100, epsilon=0.1) ) svr.fit(X_train, y_train) y_pred = svr.predict(X_test) print(f"R²: {r2_score(y_test, y_pred):.3f}") Parameters: C: regularization (same as SVC) epsilon: tube width (ε-insensitive zone) kernel: rbf, linear, poly (same as SVC) # SVR is powerful for small-medium datasets # with nonlinear relationships. For large # datasets (>10K), use tree-based regressors # — SVR's O(n²) training is too slow.

Key insight: SVR is like drawing a tube around a roller coaster track. The tube width ε defines how much wobble you tolerate. Points inside the tube are “close enough” and contribute zero error. Only the points poking outside the tube (support vectors) shape the model. This makes SVR robust to small noise while fitting the overall trend.

code

Complete SVM Pipeline

Kernel comparison on real data with grid search tuning

Kernel Comparison: Digits Dataset

from sklearn.datasets import load_digits from sklearn.model_selection import ( train_test_split, cross_val_score ) from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC from sklearn.pipeline import make_pipeline # 1,797 handwritten digits, 64 features (8x8 px) X, y = load_digits(return_X_y=True) X_tr, X_te, y_tr, y_te = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42 ) kernels = { "Linear": SVC(kernel='linear', C=1), "Poly (d=3)": SVC(kernel='poly', degree=3, C=1), "RBF": SVC(kernel='rbf', C=10, gamma=0.001), } for name, svm in kernels.items(): pipe = make_pipeline(StandardScaler(), svm) cv = cross_val_score(pipe, X_tr, y_tr, cv=5) pipe.fit(X_tr, y_tr) test_acc = pipe.score(X_te, y_te) print(f"{name:12s} CV={cv.mean():.3f} " f"Test={test_acc:.3f} " f"SVs={svm.fit(StandardScaler().fit_transform(X_tr), y_tr).n_support_.sum()}")

Expected Results

Digits (1,797 samples, 64 features, 10 classes): Linear CV=0.977 Test=0.978 SVs=280 Poly (d=3) CV=0.990 Test=0.989 SVs=350 RBF CV=0.991 Test=0.992 SVs=420 Key takeaways: • RBF wins: 99.2% on handwritten digits • Linear is surprisingly good (97.8%) • More SVs = more complex boundary • Always scale features (StandardScaler) • Grid search C and gamma for best results When to use SVMs: ✓ Small-medium data (<10K samples) ✓ High-dimensional features (text, images) ✓ Clear margin of separation exists ✗ Large data (>50K) — too slow (O(n²-n³)) ✗ Need probability estimates (SVM gives scores) ✗ Need interpretability (kernels are opaque)

Key insight: SVMs achieve 99.2% accuracy on digit recognition with just 1,797 training samples — no deep learning needed. They excel when the number of features is large relative to samples (the “d >> n” regime). But for datasets over ~10K samples, SVMs’ O(n²) training time makes them impractical. Use tree-based methods or neural networks for large-scale problems.

arrow_back Decision Trees & Forests Naive Bayes arrow_forward