Ch 1 — From Neurons to Networks

Biological inspiration, perceptrons, activation functions, and the universal approximation theorem
High Level
psychology
Biology
arrow_forward
scatter_plot
Perceptron
arrow_forward
bolt
Activation
arrow_forward
layers
MLP
arrow_forward
functions
Approx.
arrow_forward
neurology
Deep Nets
-
Click play or press Space to begin...
Step- / 8
psychology
The Biological Neuron
Where the inspiration began
How Real Neurons Work
The human brain contains roughly 86 billion neurons, each connected to thousands of others via synapses. A neuron receives electrical signals through its dendrites, sums them in the cell body, and if the total exceeds a threshold, fires an output signal along its axon. In 1943, Warren McCulloch and Walter Pitts created the first mathematical model of this process — a simple binary unit that outputs 1 if the weighted sum of inputs exceeds a threshold, 0 otherwise.
The McCulloch-Pitts Model
// McCulloch-Pitts neuron (1943) inputs: x₁, x₂, ..., xₙ (binary: 0 or 1) weights: w₁, w₂, ..., wₙ threshold: θ output = 1 if Σ(wᵢ · xᵢ) ≥ θ 0 otherwise
Key insight: McCulloch and Pitts showed that networks of these simple binary units could compute any logical function (AND, OR, NOT), establishing that neural computation is theoretically equivalent to a Turing machine.
scatter_plot
The Perceptron
The first machine that could learn
Rosenblatt's Breakthrough
In 1958, Frank Rosenblatt at Cornell Aeronautical Laboratory introduced the perceptron — the first algorithm that could learn from data. Unlike the McCulloch-Pitts model with fixed weights, the perceptron adjusts its weights based on prediction errors. Rosenblatt proved the Perceptron Convergence Theorem: if the data is linearly separable, the algorithm is guaranteed to find a separating hyperplane in finite steps. He built the Mark I Perceptron, a physical machine with 400 photocells, first demonstrated in 1960.
The Learning Rule
// Perceptron learning rule for each training example (x, y): ŷ = sign(w · x + b) // predict if ŷ ≠ y: // wrong? w = w + η · y · x // update weights b = b + η · y // update bias // η = learning rate (step size) // Converges if data is linearly separable
Why it matters: The perceptron introduced the idea that machines can learn by adjusting parameters from examples — the core principle behind all modern deep learning.
block
The XOR Problem & AI Winter
One limitation that froze a field for a decade
Minsky & Papert's Critique
In 1969, Marvin Minsky and Seymour Papert published “Perceptrons”, proving that a single-layer perceptron cannot learn XOR — a function that returns 1 when inputs differ. XOR is not linearly separable: no single straight line can divide the outputs. This mathematical proof was devastating. Funding for neural network research dried up, triggering the first AI Winter that lasted through most of the 1970s.
Critical in AI: Minsky and Papert acknowledged that multi-layer networks could solve XOR, but argued there was no known way to train them. It took until 1986 for backpropagation to provide the answer.
Why XOR Breaks a Single Layer
// XOR truth table Input A Input B Output 0 0 0 0 1 1 1 0 1 1 1 0 // No single line w₁x₁ + w₂x₂ + b = 0 // can separate the 1s from the 0s // You need at least TWO lines → TWO layers
bolt
Activation Functions
The non-linearity that gives networks their power
Why Non-Linearity Is Essential
Without activation functions, stacking layers is pointless — a chain of linear transformations is just another linear transformation. Activation functions introduce non-linearity, allowing networks to learn curved decision boundaries and complex patterns. The choice of activation function profoundly affects training speed and model capability.
The Classic Three
// Sigmoid: squashes to (0, 1) σ(z) = 1 / (1 + e⁻ᶻ) // Tanh: squashes to (-1, 1) tanh(z) = (eᶻ - e⁻ᶻ) / (eᶻ + e⁻ᶻ) // ReLU: max(0, z) — simple and fast ReLU(z) = max(0, z)
The ReLU Revolution
Sigmoid and tanh dominated early deep learning but suffer from the vanishing gradient problem: for large or small inputs, their gradients approach zero, making deep networks nearly impossible to train. In 2010, Nair and Hinton showed that Rectified Linear Units (ReLU) dramatically improved training. ReLU’s gradient is either 0 or 1 — no saturation, no vanishing. AlexNet (2012) used ReLU and won ImageNet by a huge margin, cementing it as the default activation.
Key insight: ReLU is computationally trivial (just a threshold), yet it solved one of the deepest problems in neural network training. Sometimes the simplest ideas have the biggest impact.
layers
Multi-Layer Perceptrons
Stacking layers to solve non-linear problems
Architecture of an MLP
A Multi-Layer Perceptron (MLP) stacks neurons into layers: an input layer that receives features, one or more hidden layers that learn intermediate representations, and an output layer that produces predictions. Each neuron in a layer connects to every neuron in the next layer (fully connected / dense). The hidden layers with non-linear activations are what let MLPs solve XOR and far more complex problems.
Forward Pass
// 2-layer MLP solving XOR Layer 1 (hidden, 2 neurons): h₁ = ReLU(w₁₁·x₁ + w₁₂·x₂ + b₁) h₂ = ReLU(w₂₁·x₁ + w₂₂·x₂ + b₂) Layer 2 (output, 1 neuron): ŷ = σ(v₁·h₁ + v₂·h₂ + c) // Hidden layer creates a NEW feature // space where XOR becomes separable
Key insight: Each hidden layer learns a new representation of the data. The first layer might detect edges, the second shapes, the third objects. This hierarchical feature learning is the essence of “deep” learning.
functions
Universal Approximation Theorem
The theoretical guarantee behind neural networks
What the Theorem Says
In 1989, George Cybenko proved that a feedforward network with a single hidden layer containing enough neurons with sigmoid activations can approximate any continuous function on a compact subset of Rⁿ to any desired accuracy. That same year, Hornik, Stinchcombe, and White generalized this: the result holds for virtually any non-constant, bounded, continuous activation function. In 1991, Hornik showed it is the architecture itself — not the specific activation — that provides universal approximation.
The Catch
The theorem guarantees existence but not efficiency. A single hidden layer might need an astronomically large number of neurons. In practice, deeper networks (more layers, fewer neurons per layer) are far more parameter-efficient than wide, shallow ones. This is why we build “deep” networks rather than enormous single-layer ones.
Rule of thumb: Width gives you approximation power. Depth gives you efficiency. Modern deep learning bets on depth — GPT-4 has ~120 transformer layers, not one giant hidden layer.
neurology
Anatomy of a Deep Network
Parameters, hyperparameters, and the forward pass
Parameters vs. Hyperparameters
Parameters are learned from data: weights and biases. A network with 3 layers of sizes [784, 256, 128, 10] has 784×256 + 256 + 256×128 + 128 + 128×10 + 10 = 235,146 parameters. Hyperparameters are set by the engineer: number of layers, neurons per layer, learning rate, activation function, batch size. The art of deep learning is choosing the right hyperparameters.
PyTorch Example
import torch.nn as nn class SimpleNet(nn.Module): def __init__(self): super().__init__() self.layers = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 128), nn.ReLU(), nn.Linear(128, 10), ) def forward(self, x): return self.layers(x)
Key insight: Every deep learning model, from a 3-layer MLP to GPT-4, follows the same pattern: stack layers, apply non-linearities, learn weights from data. The difference is scale and architecture.
timeline
The Road Ahead
From perceptrons to the deep learning explosion
Key Milestones
1943: McCulloch-Pitts neuron model. 1958: Rosenblatt’s perceptron. 1969: Minsky & Papert’s XOR critique triggers AI Winter. 1986: Rumelhart, Hinton & Williams popularize backpropagation. 1989: Cybenko proves universal approximation. 1998: LeCun’s LeNet for digit recognition. 2010: Nair & Hinton introduce ReLU. 2012: AlexNet wins ImageNet, launching the deep learning era. Everything since builds on these foundations.
The connection: Every concept in this chapter — weighted sums, non-linear activations, stacked layers, learning from data — is present in today’s largest models. GPT-4 and Gemini are descendants of the perceptron, scaled by 12 orders of magnitude.
What's Next
We now know what neural networks are and why they work (universal approximation). The next chapter tackles how they learn: loss functions, backpropagation, and the computational graph that makes gradient-based training possible. This is the engine that turns a randomly initialized network into a useful model.
1969 — AI Winter
Single-layer perceptrons can’t learn XOR. No known way to train multi-layer networks. Funding collapses.
2012 — Deep Learning Era
Backpropagation + ReLU + GPUs + big data = AlexNet. Deep networks dominate vision, language, and beyond.