Ch 5 — Perceptrons & Neurons

From biological neurons to artificial ones — the building blocks of every neural network
High Level
neurology
Bio Neuron
arrow_forward
radio_button_checked
Perceptron
arrow_forward
bolt
Activation
arrow_forward
close
XOR Crisis
arrow_forward
layers
MLP
arrow_forward
hub
Deep Nets
-
Click play or press Space to begin the journey...
Step- / 8
neurology
Biological Inspiration
How real neurons inspired artificial ones
The Biological Neuron
Your brain contains roughly 86 billion neurons, each connected to thousands of others via synapses. A neuron receives electrical signals through dendrites, processes them in the cell body, and if the combined signal exceeds a threshold, fires an output signal along its axon to other neurons.
The Key Insight
McCulloch and Pitts (1943) realized this could be modeled mathematically: inputs × weights → sum → threshold → output. This simple abstraction became the foundation of all neural networks. The brain computes through massive parallelism of simple units — artificial neural networks follow the same principle.
# Biological neuron → artificial neuron Dendrites → Inputs (x\u2081, x\u2082, ... x\u2099) Synapses → Weights (w\u2081, w\u2082, ... w\u2099) Cell body → Weighted sum + bias Threshold → Activation function Axon output → Output (y) # The artificial neuron computes: z = w\u2081x\u2081 + w\u2082x\u2082 + ... + w\u2099x\u2099 + b y = activation(z)
Important caveat: Artificial neurons are a loose analogy to biological neurons. Real neurons use complex electrochemical signaling, have timing-dependent plasticity, and operate in ways we still don’t fully understand. The analogy is useful for intuition but shouldn’t be taken literally.
radio_button_checked
The Perceptron
Frank Rosenblatt’s 1958 learning machine
How It Works
The perceptron is the simplest possible neural network — a single artificial neuron. It takes multiple inputs, multiplies each by a learned weight, adds a bias, and passes the result through a step function. If the sum exceeds the threshold, output 1; otherwise, output 0.
# Perceptron computation Inputs: x = [x\u2081, x\u2082, ..., x\u2099] Weights: w = [w\u2081, w\u2082, ..., w\u2099] Bias: b # Step 1: Weighted sum z = w\u2081x\u2081 + w\u2082x\u2082 + ... + w\u2099x\u2099 + b # Step 2: Step function y = 1 if z ≥ 0 y = 0 if z < 0
The Learning Rule
Rosenblatt’s key innovation: the perceptron learns its own weights from labeled examples. For each misclassified example, nudge the weights in the direction that reduces the error. Repeat until all training examples are classified correctly.
# Perceptron learning rule For each training example (x, target): prediction = perceptron(x) error = target - prediction if error ≠ 0: w = w + η × error × x b = b + η × error # η = learning rate (small, e.g. 0.01) # Convergence theorem: if data is linearly # separable, this WILL find a solution
1958 hype: The New York Times reported the perceptron as a machine that “will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.” The reality was far more modest — but the learning principle was genuinely revolutionary.
check_circle
What a Perceptron Can Do
Linear decision boundaries and logical gates
Linear Separability
A perceptron draws a straight line (or hyperplane in higher dimensions) to separate two classes. Any problem where the classes can be divided by a straight line is “linearly separable” — and a perceptron can solve it perfectly.
# AND gate — perceptron CAN learn this x\u2081 x\u2082 | AND 0 0 | 0 0 1 | 0 1 0 | 0 1 1 | 1 ← only both true # OR gate — perceptron CAN learn this x\u2081 x\u2082 | OR 0 0 | 0 0 1 | 1 1 0 | 1 1 1 | 1 ← either true
Geometric View
The weights define a decision boundary — a line that separates the input space into two regions. Points on one side are classified as 1, points on the other as 0. The weight vector is perpendicular to this boundary. The bias shifts the boundary away from the origin.
# AND gate solution w\u2081 = 1.0, w\u2082 = 1.0, b = -1.5 z = 1.0×x\u2081 + 1.0×x\u2082 - 1.5 (0,0): z = -1.5 < 0 → 0 (0,1): z = -0.5 < 0 → 0 (1,0): z = -0.5 < 0 → 0 (1,1): z = +0.5 ≥ 0 → 1
The perceptron convergence theorem (Rosenblatt, 1962) guarantees that if the data is linearly separable, the learning algorithm will find a solution in a finite number of steps. This was one of the first mathematical guarantees in machine learning.
close
The XOR Crisis
The fatal limitation that nearly killed neural networks
The Problem
In 1969, Minsky and Papert published Perceptrons, proving that a single perceptron cannot learn XOR (exclusive OR). XOR outputs 1 when inputs differ and 0 when they’re the same. No single straight line can separate the 1s from the 0s.
# XOR gate — NOT linearly separable x\u2081 x\u2082 | XOR 0 0 | 0 ← same = 0 0 1 | 1 ← different = 1 1 0 | 1 ← different = 1 1 1 | 0 ← same = 0 No single line can separate the 1s from 0s! # The 1s are at opposite corners of a square # You need at least TWO lines (curves)
The Impact
Minsky and Papert’s proof was mathematically correct but their implied conclusion — that multi-layer networks were also limited — was wrong. However, the book devastated neural network research funding. The first AI Winter (1970s) followed, as researchers and funders abandoned connectionism.
Single Perceptron
Can only learn linearly separable functions. AND, OR, NOT — yes. XOR, circles, spirals — no. One straight line is all it has.
Multi-Layer Network
Can learn any function (universal approximation). XOR, complex boundaries, images, language. Multiple layers = multiple decision boundaries combined.
The irony: Minsky and Papert knew multi-layer networks could solve XOR. The problem was that nobody knew how to train them. The backpropagation algorithm (Ch 6) wouldn’t become practical until 1986 — 17 years later.
bolt
Activation Functions
The non-linearity that gives neural networks their power
Why Non-Linearity?
Without activation functions, stacking layers is pointless — multiple linear transformations collapse into a single linear transformation. Activation functions introduce non-linearity, allowing networks to learn curved decision boundaries and complex patterns.
# Key activation functions Step y = 1 if z≥0, else 0 Binary, not differentiable Used in: original perceptron Sigmoid y = 1 / (1 + e\u207b\u1dbb) Range: (0, 1), smooth S-curve Used in: output for binary class. Tanh y = (e\u1dbb - e\u207b\u1dbb) / (e\u1dbb + e\u207b\u1dbb) Range: (-1, 1), zero-centered Used in: RNN hidden layers ReLU y = max(0, z) Range: [0, ∞), dead simple Used in: default for hidden layers Softmax y\u1d62 = e\u1dbb\u1d62 / ∑e\u1dbb\u2c7c Outputs sum to 1 (probabilities) Used in: multi-class output
ReLU: The Modern Default
ReLU (Rectified Linear Unit) is the most widely used activation function in deep learning. It’s computationally cheap (just a max operation), doesn’t saturate for positive values, and creates sparse activations (many neurons output exactly 0).
The Vanishing Gradient Problem
Sigmoid and tanh squash outputs into small ranges. For very large or very small inputs, the gradient approaches zero. During backpropagation, these tiny gradients multiply across layers, making deep networks nearly impossible to train. ReLU solved this — its gradient is 1 for positive inputs, enabling training of much deeper networks.
ReLU variants: Leaky ReLU (small slope for negatives), ELU (smooth for negatives), GELU (used in transformers — smooth approximation of ReLU). Swish (x × sigmoid(x)) is used in some modern architectures. But plain ReLU remains the default starting point.
layers
The Multilayer Perceptron (MLP)
Stacking layers to solve any problem
Architecture
An MLP stacks multiple layers of neurons: an input layer (receives features), one or more hidden layers (learn representations), and an output layer (produces predictions). Each neuron in one layer connects to every neuron in the next — hence “fully connected” or “dense” layers.
# MLP for XOR (2 inputs, 1 output) Input layer: 2 neurons (x\u2081, x\u2082) Hidden layer: 2 neurons (h\u2081, h\u2082) Output layer: 1 neuron (y) # Forward pass: h\u2081 = ReLU(w\u2081\u2081x\u2081 + w\u2081\u2082x\u2082 + b\u2081) h\u2082 = ReLU(w\u2082\u2081x\u2081 + w\u2082\u2082x\u2082 + b\u2082) y = sigmoid(v\u2081h\u2081 + v\u2082h\u2082 + b\u2083) # Hidden layer transforms the space so # XOR becomes linearly separable!
Why Hidden Layers Work
Each hidden layer transforms the input space into a new representation. The first hidden layer might learn simple features (edges, basic patterns). Deeper layers combine these into complex features. The output layer then draws a simple boundary in this transformed space.
XOR Solved
The hidden layer warps the 2D input space so that the XOR points become linearly separable. Think of it as folding a piece of paper so that previously separated points now line up on the same side.
Universal Approximation Theorem (Cybenko, 1989; Hornik, 1991): A neural network with a single hidden layer containing enough neurons can approximate any continuous function to arbitrary accuracy. This is an existence proof — it guarantees a solution exists but doesn’t say how to find it or how many neurons you need.
hub
From MLP to Deep Networks
Why depth matters more than width
Depth vs Width
The universal approximation theorem says a wide single-layer network can approximate anything. But in practice, deeper networks are exponentially more efficient. A function that requires millions of neurons in one layer might need only hundreds across several layers. Depth enables hierarchical feature learning.
# Typical modern architectures Shallow MLP (1-2 hidden layers) Tabular data, simple classification Parameters: thousands Deep MLP (3-8 hidden layers) Complex tabular, feature extraction Parameters: millions CNN (10-150+ layers) Images, spatial data Parameters: millions to billions Transformer (12-96+ layers) Text, multimodal Parameters: billions to trillions
Hierarchical Features
Layer 1: Learns edges, simple patterns
Layer 2: Combines edges into textures, shapes
Layer 3: Combines shapes into parts (eyes, wheels)
Layer 4+: Combines parts into objects (faces, cars)

Each layer builds on the previous one, creating increasingly abstract representations. This is why deep networks excel at complex tasks like image recognition and language understanding.
The depth revolution: AlexNet (2012) had 8 layers. VGG (2014) had 19. ResNet (2015) had 152. GPT-4 has 120+ transformer layers. Depth is the key enabler of modern AI — but training deep networks requires techniques covered in Ch 6 (backpropagation, batch normalization, residual connections).
neurology
Neurons in Modern AI
From Rosenblatt’s perceptron to GPT’s billions of parameters
The Same Core Idea
Every modern neural network — CNNs, transformers, diffusion models — is built from the same fundamental unit: weighted sum + bias + activation. The perceptron’s core computation hasn’t changed since 1958. What changed is scale, architecture, and training methods.
# Scale of modern networks Perceptron (1958) Neurons: 1 Parameters: ~10 LeNet-5 (1998) Neurons: ~60K Parameters: 60K AlexNet (2012) Neurons: ~650K Parameters: 60M GPT-3 (2020) Neurons: ~billions Parameters: 175B GPT-4 (2023) Parameters: ~1.8T (estimated)
Key Takeaways
1. A neuron computes: weighted sum + bias + activation

2. A single perceptron can only learn linearly separable functions

3. The XOR problem proved single layers are limited

4. Hidden layers transform input space, enabling non-linear boundaries

5. ReLU activation solved the vanishing gradient problem

6. The universal approximation theorem guarantees expressiveness

7. Depth enables hierarchical feature learning — the foundation of modern deep learning
Coming up: Ch 6 covers how these networks learn — backpropagation, gradient descent in practice, batch normalization, and the training tricks that make deep learning work. The neuron is the atom; training is the chemistry.