Ch 1: From Neurons to Networks — Deep Learning Fundamentals

psychology

The Biological Neuron

Where the inspiration began

How Real Neurons Work

The human brain contains roughly 86 billion neurons, each connected to thousands of others via synapses. A neuron receives electrical signals through its dendrites, sums them in the cell body, and if the total exceeds a threshold, fires an output signal along its axon. In 1943, Warren McCulloch and Walter Pitts created the first mathematical model of this process — a simple binary unit that outputs 1 if the weighted sum of inputs exceeds a threshold, 0 otherwise.

The McCulloch-Pitts Model

// McCulloch-Pitts neuron (1943) inputs: x₁, x₂, ..., xₙ (binary: 0 or 1) weights: w₁, w₂, ..., wₙ threshold: θ output = 1 if Σ(wᵢ · xᵢ) ≥ θ 0 otherwise

Key insight: McCulloch and Pitts showed that networks of these simple binary units could compute any logical function (AND, OR, NOT), establishing that neural computation is theoretically equivalent to a Turing machine.

scatter_plot

The Perceptron

The first machine that could learn

Rosenblatt's Breakthrough

In 1958, Frank Rosenblatt at Cornell Aeronautical Laboratory introduced the perceptron — the first algorithm that could learn from data. Unlike the McCulloch-Pitts model with fixed weights, the perceptron adjusts its weights based on prediction errors. Rosenblatt proved the Perceptron Convergence Theorem: if the data is linearly separable, the algorithm is guaranteed to find a separating hyperplane in finite steps. He built the Mark I Perceptron, a physical machine with 400 photocells, first demonstrated in 1960.

The Learning Rule

// Perceptron learning rule for each training example (x, y): ŷ = sign(w · x + b) // predict if ŷ ≠ y: // wrong? w = w + η · y · x // update weights b = b + η · y // update bias // η = learning rate (step size) // Converges if data is linearly separable

Why it matters: The perceptron introduced the idea that machines can learn by adjusting parameters from examples — the core principle behind all modern deep learning.

block

The XOR Problem & AI Winter

One limitation that froze a field for a decade

Minsky & Papert's Critique

In 1969, Marvin Minsky and Seymour Papert published “Perceptrons”, proving that a single-layer perceptron cannot learn XOR — a function that returns 1 when inputs differ. XOR is not linearly separable: no single straight line can divide the outputs. This mathematical proof was devastating. Funding for neural network research dried up, triggering the first AI Winter that lasted through most of the 1970s.

Critical in AI: Minsky and Papert acknowledged that multi-layer networks could solve XOR, but argued there was no known way to train them. It took until 1986 for backpropagation to provide the answer.

Why XOR Breaks a Single Layer

// XOR truth table Input A Input B Output 0 0 0 0 1 1 1 0 1 1 1 0 // No single line w₁x₁ + w₂x₂ + b = 0 // can separate the 1s from the 0s // You need at least TWO lines → TWO layers

bolt

Activation Functions

The non-linearity that gives networks their power

Why Non-Linearity Is Essential

Without activation functions, stacking layers is pointless — a chain of linear transformations is just another linear transformation. Activation functions introduce non-linearity, allowing networks to learn curved decision boundaries and complex patterns. The choice of activation function profoundly affects training speed and model capability.

The Classic Three

// Sigmoid: squashes to (0, 1) σ(z) = 1 / (1 + e⁻ᶻ) // Tanh: squashes to (-1, 1) tanh(z) = (eᶻ - e⁻ᶻ) / (eᶻ + e⁻ᶻ) // ReLU: max(0, z) — simple and fast ReLU(z) = max(0, z)

The ReLU Revolution

Sigmoid and tanh dominated early deep learning but suffer from the vanishing gradient problem: for large or small inputs, their gradients approach zero, making deep networks nearly impossible to train. In 2010, Nair and Hinton showed that Rectified Linear Units (ReLU) dramatically improved training. ReLU’s gradient is either 0 or 1 — no saturation, no vanishing. AlexNet (2012) used ReLU and won ImageNet by a huge margin, cementing it as the default activation.

Key insight: ReLU is computationally trivial (just a threshold), yet it solved one of the deepest problems in neural network training. Sometimes the simplest ideas have the biggest impact.

layers

Multi-Layer Perceptrons

Stacking layers to solve non-linear problems

Architecture of an MLP

A Multi-Layer Perceptron (MLP) stacks neurons into layers: an input layer that receives features, one or more hidden layers that learn intermediate representations, and an output layer that produces predictions. Each neuron in a layer connects to every neuron in the next layer (fully connected / dense). The hidden layers with non-linear activations are what let MLPs solve XOR and far more complex problems.

Forward Pass

// 2-layer MLP solving XOR Layer 1 (hidden, 2 neurons): h₁ = ReLU(w₁₁·x₁ + w₁₂·x₂ + b₁) h₂ = ReLU(w₂₁·x₁ + w₂₂·x₂ + b₂) Layer 2 (output, 1 neuron): ŷ = σ(v₁·h₁ + v₂·h₂ + c) // Hidden layer creates a NEW feature // space where XOR becomes separable

Key insight: Each hidden layer learns a new representation of the data. The first layer might detect edges, the second shapes, the third objects. This hierarchical feature learning is the essence of “deep” learning.

functions

Universal Approximation Theorem

The theoretical guarantee behind neural networks

What the Theorem Says

In 1989, George Cybenko proved that a feedforward network with a single hidden layer containing enough neurons with sigmoid activations can approximate any continuous function on a compact subset of Rⁿ to any desired accuracy. That same year, Hornik, Stinchcombe, and White generalized this: the result holds for virtually any non-constant, bounded, continuous activation function. In 1991, Hornik showed it is the architecture itself — not the specific activation — that provides universal approximation.

The Catch

The theorem guarantees existence but not efficiency. A single hidden layer might need an astronomically large number of neurons. In practice, deeper networks (more layers, fewer neurons per layer) are far more parameter-efficient than wide, shallow ones. This is why we build “deep” networks rather than enormous single-layer ones.

Rule of thumb: Width gives you approximation power. Depth gives you efficiency. Modern deep learning bets on depth — GPT-4 has ~120 transformer layers, not one giant hidden layer.

neurology

Anatomy of a Deep Network

Parameters, hyperparameters, and the forward pass

Parameters vs. Hyperparameters

Parameters are learned from data: weights and biases. A network with 3 layers of sizes [784, 256, 128, 10] has 784×256 + 256 + 256×128 + 128 + 128×10 + 10 = 235,146 parameters. Hyperparameters are set by the engineer: number of layers, neurons per layer, learning rate, activation function, batch size. The art of deep learning is choosing the right hyperparameters.

PyTorch Example

import torch.nn as nn class SimpleNet(nn.Module): def __init__(self): super().__init__() self.layers = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 128), nn.ReLU(), nn.Linear(128, 10), ) def forward(self, x): return self.layers(x)

Key insight: Every deep learning model, from a 3-layer MLP to GPT-4, follows the same pattern: stack layers, apply non-linearities, learn weights from data. The difference is scale and architecture.

timeline

The Road Ahead

From perceptrons to the deep learning explosion

Key Milestones

1943: McCulloch-Pitts neuron model. 1958: Rosenblatt’s perceptron. 1969: Minsky & Papert’s XOR critique triggers AI Winter. 1986: Rumelhart, Hinton & Williams popularize backpropagation. 1989: Cybenko proves universal approximation. 1998: LeCun’s LeNet for digit recognition. 2010: Nair & Hinton introduce ReLU. 2012: AlexNet wins ImageNet, launching the deep learning era. Everything since builds on these foundations.

The connection: Every concept in this chapter — weighted sums, non-linear activations, stacked layers, learning from data — is present in today’s largest models. GPT-4 and Gemini are descendants of the perceptron, scaled by 12 orders of magnitude.

What's Next

We now know what neural networks are and why they work (universal approximation). The next chapter tackles how they learn: loss functions, backpropagation, and the computational graph that makes gradient-based training possible. This is the engine that turns a randomly initialized network into a useful model.

1969 — AI Winter

Single-layer perceptrons can’t learn XOR. No known way to train multi-layer networks. Funding collapses.

2012 — Deep Learning Era

Backpropagation + ReLU + GPUs + big data = AlexNet. Deep networks dominate vision, language, and beyond.

Ch 1 — From Neurons to Networks