What is Backpropagation in Neural Networks?
Backpropagation (backward propagation of errors) is the foundational algorithm for training artificial neural networks. It efficiently computes gradients of a network’s loss function with respect to its parameters (weights and biases) so that optimization algorithms (typically gradient descent and its variants) can update the parameters to reduce error. Backpropagation is not an optimizer itself — it’s a technique for computing derivatives using the chain rule across the computational graph of the network.
This article is a comprehensive deep dive: history, mathematical foundations, detailed derivations, pseudocode and working code, practical issues, advanced variants, limitations, and future directions.
Table of contents
- Introduction and intuition
- Historical background
- Mathematical foundations
- Scalar chain rule review
- Derivation for a single hidden layer
- Vectorized form for deep networks
- Backpropagation algorithm (step-by-step)
- Pseudocode
- Computational graph perspective
- Example: numeric walk-through (small network)
- Implementations
- NumPy vectorized example
- PyTorch example (autodiff)
- Gradient checking
- Practical considerations and training tricks
- Initialization
- Activation choices
- Learning rates and optimizers
- Mini-batches
- Regularization: weight decay, dropout
- BatchNorm, layer norm, residual connections
- Gradient clipping and normalization
- Advanced topics
- Backpropagation through time (BPTT) for RNNs
- Truncated BPTT
- Second-order methods and Hessian information
- Automatic differentiation vs symbolic derivatives
- Memory vs compute trade-offs (checkpointing)
- Biologically plausible alternatives
- Failure modes and limitations
- Vanishing and exploding gradients
- Saddle points and plateaus
- Non-differentiable operations and estimators
- Current state of the art & research directions
- Real-world applications and examples
- Summary
- Further reading and references
Introduction and intuition
At a high level, training a neural network involves:
- Forward pass: compute the network output and the loss function value given inputs and current parameters.
- Backward pass (backpropagation): compute gradients of the loss wrt each parameter.
- Update parameters using an optimization algorithm (e.g., gradient descent, Adam).
Backpropagation answers the second step by using the chain rule to propagate derivative information from the outputs back to the inputs of each layer. Instead of computing each derivative from scratch, it reuses intermediate gradients computed at downstream nodes — making the computation efficient (O(number of edges) complexity).
Intuitively: if the output error depends on an intermediate quantity, and that intermediate depends on a parameter, the chain rule tells us how the parameter affects the error via the intermediate.
Historical background
- The chain rule and gradient-based methods predate neural networks.
- The modern backpropagation algorithm for neural networks became widely recognized in the 1980s after the seminal 1986 paper by Rumelhart, Hinton, and Williams, "Learning representations by back-propagating errors".
- Earlier antecedents include work on automatic differentiation and multilayer perceptrons, but the 1986 paper popularized the method in neural network research.
- Backpropagation is a form of reverse-mode automatic differentiation applied to neural networks and computational graphs.
Mathematical foundations
Scalar chain rule review
If y = f(u) and u = g(x), then dy/dx = (dy/du)*(du/dx). For compositions of many functions, chain rule applies repeatedly.
Derivation for a single hidden layer
Consider a simple feedforward network:
- Input x ∈ R^n
- Hidden layer: z = W1 x + b1, a = φ(z) (φ applied elementwise)
- Output layer: ŷ = W2 a + b2
- Loss: L(ŷ, y) (y is ground truth)
Goal: compute gradients ∂L/∂W2, ∂L/∂b2, ∂L/∂W1, ∂L/∂b1.
Stepwise:
- δ_out = ∂L/∂ŷ (depends on loss and output).
- ∂L/∂W2 = δout a^T (if ŷ = W2 a + b2 and δout has same shape as ŷ).
- ∂L/∂b2 = δ_out (sum over batch if batched).
- Propagate back to a: ∂L/∂a = W2^T δ_out.
- For hidden pre-activation z: ∂L/∂z = (∂L/∂a) ∘ φ'(z) where ∘ denotes elementwise product.
- ∂L/∂W1 = (∂L/∂z) x^T.
- ∂L/∂b1 = ∂L/∂z.
This exemplifies how gradients are computed layer-by-layer from output to input.
Vectorized form for deep networks
For a deep network of L layers, denote:
- at layer ℓ: z^ℓ = W^ℓ a^{ℓ-1} + b^ℓ, a^ℓ = φ^ℓ(z^ℓ)
- a^0 = x, a^L = ŷ
Backprop recursion:
- δ^L = ∂L/∂z^L = ∂L/∂a^L ∘ φ'^L(z^L)
- For ℓ = L-1,...,1:
δ^ℓ = (W^{ℓ+1})^T δ^{ℓ+1} ∘ φ'^ℓ(z^ℓ)
- Gradients:
∂L/∂W^ℓ = δ^ℓ (a^{ℓ-1})^T ∂L/∂b^ℓ = sumoverbatch(δ^ℓ) (or just δ^ℓ for single sample)
This is the vectorized, matrix-based backprop used in practice.
Backpropagation algorithm (step-by-step)
High-level training loop:
- For each mini-batch:
a. Forward pass: compute activations a^ℓ and pre-activations z^ℓ for ℓ = 1..L. b. Compute loss L and initial gradient ∂L/∂a^L. c. Backward pass: compute δ^ℓ using recursion above. d. Compute parameter gradients ∂L/∂W^ℓ and ∂L/∂b^ℓ. e. Update parameters with chosen optimizer.
Pseudocode:
```
Forward
a[0] = x for l in 1..L: z[l] = W[l] @ a[l-1] + b[l] a[l] = phi[l](z[l])
Compute loss
loss = Loss(a[L], y)
Backward
delta[L] = dLossda(a[L], y) phiprime[L](z[L]) for l in L..1: dW[l] = delta[l] @ a[l-1].T db[l] = sum(delta[l] over batch) if l > 1: delta[l-1] = (W[l].T @ delta[l]) phi_prime[l-1](z[l-1])
Update parameters, e.g.
for l in 1..L: W[l] -= learningrate dW[l] b[l] -= learningrate db[l] ```
Complexity: For a dense feedforward network, computing gradients has similar complexity to a forward pass (approximately 2×–3× the cost of forward pass if implemented carefully).
Computational graph perspective
Backpropagation arises naturally when you view the model as a computational graph where nodes compute functions of their inputs. Reverse-mode automatic differentiation (AD) traverses the graph from outputs backward, accumulating gradients for each node using local derivatives. Reverse-mode AD is efficient when the function maps many inputs (parameters) to few outputs (loss), which is the usual case.
Key ideas:
- Store intermediate forward values (z^ℓ, a^ℓ) because backward uses them.
- Compute local gradients at each node and combine via chain rule.
- Use dynamic programming to avoid repeated computation.
Example: numeric walk-through (small network)
Consider single input x=1.0, hidden size 2, one output. Let: W1 = [[0.1, 0.2]]^T? Better give explicit:
x = [1.0] (scalar) W1 = [[0.5], [-0.3]] (2x1) b1 = [0.0, 0.0] φ = tanh W2 = [0.4, -0.2] (1x2) b2 = 0.0 Loss = 0.5*(ŷ - y)^2 with y = 0.0
Forward: z1 = W1x + b1 = [0.5, -0.3] a1 = tanh(z1) ≈ [0.4621, -0.2913] ŷ = W2·a1 + b2 = 0.40.4621 + (-0.2)(-0.2913) ≈ 0.1848 + 0.0583 = 0.2431 Loss = 0.5(0.2431)^2 ≈ 0.0296
Backward: dLoss/dŷ = ŷ - y = 0.2431 For output layer (no activation): δ2 = 0.2431 Grad W2 = δ2 a1 = [0.24310.4621, 0.2431(-0.2913)] ≈ [0.1123, -0.0708] Backprop to hidden pre-activations: dL/da1 = W2^T δ2 = [0.4, -0.2]0.2431 = [0.0972, -0.0486] δ1 = dL/da1 ∘ φ'(z1), where φ'(z) = 1 - tanh^2(z) φ'(z1) ≈ [1 - 0.4621^2, 1 - (-0.2913)^2] = [0.7863, 0.9152] δ1 ≈ [0.09720.7863, -0.04860.9152] = [0.0764, -0.0445] Grad W1 = δ1 x = δ1 (since x=1.0) → [0.0764, -0.0445] These gradients are then used to update weights.
This numeric example illustrates how error distributes to parameters.
Implementations
NumPy vectorized example (single hidden layer)
```python import numpy as np
def sigmoid(x): return 1 / (1 + np.exp(-x)) def sigmoid_prime(x): s = sigmoid(x) return s * (1 - s)
Simple network: inputdim -> hiddendim -> output_dim
def forward(x, W1, b1, W2, b2): z1 = W1.dot(x) + b1 a1 = sigmoid(z1) z2 = W2.dot(a1) + b2 yhat = z2 # linear output cache = (x, z1, a1, z2) return yhat, cache
def backward(yhat, y, cache, W2): x, z1, a1, z2 = cache dz2 ...