A learning path ready to make your own.

What is backpropagation in neural networks?

Backpropagation in Neural Networks — Concise Summary Backpropagation (backward propagation of errors) is the standard algorithm for computing gradients of a neural network’s loss with respect to its parameters. It uses the chain rule across the network’s computational graph (reverse‑mode automatic differentiation) to propagate derivative information from outputs back to inputs so optimizers (SGD, Adam, etc.) can update weights and biases. Backpropagation itself is a gradient-computation technique, not an optimizer. Key ideas and intuition Training loop: forward pass (compute outputs & loss) → backward pass (compute gradients via chain rule) → parameter update. Instead of rederiving each derivative, backprop reuses intermediate gradients (dynamic programming) yielding efficient computation roughly proportional to the forward pass. View as traversing a computational graph in reverse, accumulating local gradients at each node and combining them. Mathematical foundation (essentials) Scalar chain rule: if y=f(u), u=g(x) then dy/dx = (dy/du)(du/dx). Single hidden layer: for x → z=W1 x + b1 → a=φ(z) → ŷ=W2 a + b2, with loss L(ŷ,y): δ_out = ∂L/∂ŷ ∂L/∂W2 = δ_out a^T, ∂L/∂b2 = δ_out ∂L/∂a = W2^T δ_out, ∂L/∂z = (∂L/∂a) ∘ φ'(z) ∂L/∂W1 = (∂L/∂z) x^T, ∂L/∂b1 = ∂L/∂z Deep networks (vectorized): define δ^L = ∂L/∂z^L = (∂L/∂a^L) ∘ φ'^L(z^L) and recurse δ^ℓ = (W^{ℓ+1})^T δ^{ℓ+1} ∘ φ'^ℓ(z^ℓ). Gradients: ∂L/∂W^ℓ = δ^ℓ (a^{ℓ-1})^T, ∂L/∂b^ℓ = sum_batch(δ^ℓ). Algorithm (high level) Forward: compute and store pre-activations z^ℓ and activations a^ℓ for ℓ=1..L. Compute loss and initial ∂L/∂a^L. Backward: compute δ^L, then iterate ℓ=L..1 computing parameter gradients and δ^{ℓ-1} as needed. Update parameters using chosen optimizer and learning rate. Complexity: gradient computation comparable to a few forward passes (2–3× cost typically). Computational graph perspective Reverse‑mode automatic differentiation traverses graph backward, using stored forward intermediates. Efficient when mapping many parameters to a few outputs (typical neural nets). Memory/computation trade-offs: storing activations costs memory; checkpointing can recompute to save memory at extra compute cost. Numeric example (illustrative) Small network with one scalar input, two hidden units (tanh), one linear output, MSE loss. Forward pass yields a small output and loss; backward pass computes output-layer gradient δ_out, propagates to hidden using W2^T and φ'(z), and produces gradients for W2 and W1. This demonstrates how error is distributed to parameters (gradients ready for updates). Implementations Manual: NumPy vectorized forward/backward for small networks — useful for learning and debugging. Frameworks: PyTorch/TensorFlow/JAX use reverse‑mode autodiff (loss.backward()) to compute gradients automatically. Gradient checking: finite-difference approximations validate analytic backprop (use small eps and compare relative differences). Practical considerations & training tricks Initialization: Xavier/Glorot for tanh/sigmoid, He for ReLU to avoid vanishing/exploding signals. Activations: prefer ReLU/LeakyReLU/GELU/Swish over saturating sigmoid/tanh in deep nets. Optimizers & LR: learning rate scheduling, warmup, and optimizers (SGD+momentum, Adam, AdamW, etc.) are critical. Mini-batching improves stability and throughput; regularization (L2, dropout, early stopping) combats overfitting. Normalization (BatchNorm/LayerNorm) and residual connections (ResNets) improve gradient flow for deep models. Gradient clipping, mixed precision, and checkpointing address exploding gradients, memory, and performance concerns. Advanced topics BPTT (backpropagation through time) for RNNs and truncated BPTT for long sequences. Second‑order and curvature-aware methods (Newton, quasi‑Newton, K‑FAC) use Hessian information but are heavier computationally. Automatic differentiation vs symbolic differentiation: AD is used in practice; symbolic differentiation tends to blow up. Biologically plausible alternatives (feedback alignment, local rules) are active research directions. Failure modes & limitations Vanishing/exploding gradients (worse with depth or long time horizons in RNNs). Saddle points and plateaus in high-dimensional nonconvex loss landscapes. Non-differentiable operations (discrete sampling) need estimators or reparameterization tricks. Memory demands: storing activations for backward pass can be expensive for very large models. Current research directions Scaling training of massive models (transformers) with mixed precision, parallelism, and memory optimizations. Optimizer improvements and understanding generalization (Adam variants, Lion, Adafactor). Curvature-aware methods, improved normalization schemes, memory/computation reduction (checkpointing, pruning, quantization). Work on alternatives to backprop and better handling of non‑differentiable components. Applications Backprop-trained networks power computer vision, NLP (transformers), speech/audio, reinforcement learning, recommender systems, generative models (GANs/VAEs), scientific modeling, and many more production systems. Summary Backpropagation efficiently computes gradients via the chain rule and reverse‑mode AD and is central to modern deep learning. Effective training requires careful initialization, architecture/activation choices, normalization, optimizer and hyperparameter tuning, and techniques to mitigate vanishing/exploding gradients. Despite limitations and debate about biological plausibility, backprop remains the dominant, actively refined method for training neural networks. Selected references Rumelhart, Hinton & Williams (1986) — "Learning representations by back-propagating errors". Goodfellow, Bengio & Courville (2016) — Deep Learning (chapter on optimization/AD). LeCun et al. (1998) — "Efficient BackProp". Pascanu, Mikolov & Bengio (2013) — "On the difficulty of training recurrent neural networks".

Let the lesson walk with you.

Podcast

What is backpropagation in neural networks? podcast

0:00-4:06

Follow the trail that experts already trust.

Resources

Turn quick sparks into lasting recall.

Flashcards

What is backpropagation in neural networks? flashcards

16 cards

Question

Click to flip
Answer

Prove the idea before it slips away.

Quizzes

What is backpropagation in neural networks? quiz

12 questions

What is backpropagation (backward propagation of errors) in the context of training neural networks?

Read deeper, connect wider, own the subject.

Deep Article

What is Backpropagation in Neural Networks?

Backpropagation (backward propagation of errors) is the foundational algorithm for training artificial neural networks. It efficiently computes gradients of a network’s loss function with respect to its parameters (weights and biases) so that optimization algorithms (typically gradient descent and its variants) can update the parameters to reduce error. Backpropagation is not an optimizer itself — it’s a technique for computing derivatives using the chain rule across the computational graph of the network.

This article is a comprehensive deep dive: history, mathematical foundations, detailed derivations, pseudocode and working code, practical issues, advanced variants, limitations, and future directions.

Table of contents

  • Introduction and intuition
  • Historical background
  • Mathematical foundations
  • Scalar chain rule review
  • Derivation for a single hidden layer
  • Vectorized form for deep networks
  • Backpropagation algorithm (step-by-step)
  • Pseudocode
  • Computational graph perspective
  • Example: numeric walk-through (small network)
  • Implementations
  • NumPy vectorized example
  • PyTorch example (autodiff)
  • Gradient checking
  • Practical considerations and training tricks
  • Initialization
  • Activation choices
  • Learning rates and optimizers
  • Mini-batches
  • Regularization: weight decay, dropout
  • BatchNorm, layer norm, residual connections
  • Gradient clipping and normalization
  • Advanced topics
  • Backpropagation through time (BPTT) for RNNs
  • Truncated BPTT
  • Second-order methods and Hessian information
  • Automatic differentiation vs symbolic derivatives
  • Memory vs compute trade-offs (checkpointing)
  • Biologically plausible alternatives
  • Failure modes and limitations
  • Vanishing and exploding gradients
  • Saddle points and plateaus
  • Non-differentiable operations and estimators
  • Current state of the art & research directions
  • Real-world applications and examples
  • Summary
  • Further reading and references

Introduction and intuition

At a high level, training a neural network involves:

  1. Forward pass: compute the network output and the loss function value given inputs and current parameters.
  2. Backward pass (backpropagation): compute gradients of the loss wrt each parameter.
  3. Update parameters using an optimization algorithm (e.g., gradient descent, Adam).

Backpropagation answers the second step by using the chain rule to propagate derivative information from the outputs back to the inputs of each layer. Instead of computing each derivative from scratch, it reuses intermediate gradients computed at downstream nodes — making the computation efficient (O(number of edges) complexity).

Intuitively: if the output error depends on an intermediate quantity, and that intermediate depends on a parameter, the chain rule tells us how the parameter affects the error via the intermediate.


Historical background

  • The chain rule and gradient-based methods predate neural networks.
  • The modern backpropagation algorithm for neural networks became widely recognized in the 1980s after the seminal 1986 paper by Rumelhart, Hinton, and Williams, "Learning representations by back-propagating errors".
  • Earlier antecedents include work on automatic differentiation and multilayer perceptrons, but the 1986 paper popularized the method in neural network research.
  • Backpropagation is a form of reverse-mode automatic differentiation applied to neural networks and computational graphs.

Mathematical foundations

Scalar chain rule review

If y = f(u) and u = g(x), then dy/dx = (dy/du)*(du/dx). For compositions of many functions, chain rule applies repeatedly.

Derivation for a single hidden layer

Consider a simple feedforward network:

  • Input x ∈ R^n
  • Hidden layer: z = W1 x + b1, a = φ(z) (φ applied elementwise)
  • Output layer: ŷ = W2 a + b2
  • Loss: L(ŷ, y) (y is ground truth)

Goal: compute gradients ∂L/∂W2, ∂L/∂b2, ∂L/∂W1, ∂L/∂b1.

Stepwise:

  1. δ_out = ∂L/∂ŷ (depends on loss and output).
  2. ∂L/∂W2 = δout a^T (if ŷ = W2 a + b2 and δout has same shape as ŷ).
  3. ∂L/∂b2 = δ_out (sum over batch if batched).
  4. Propagate back to a: ∂L/∂a = W2^T δ_out.
  5. For hidden pre-activation z: ∂L/∂z = (∂L/∂a) ∘ φ'(z) where ∘ denotes elementwise product.
  6. ∂L/∂W1 = (∂L/∂z) x^T.
  7. ∂L/∂b1 = ∂L/∂z.

This exemplifies how gradients are computed layer-by-layer from output to input.

Vectorized form for deep networks

For a deep network of L layers, denote:

  • at layer ℓ: z^ℓ = W^ℓ a^{ℓ-1} + b^ℓ, a^ℓ = φ^ℓ(z^ℓ)
  • a^0 = x, a^L = ŷ

Backprop recursion:

  • δ^L = ∂L/∂z^L = ∂L/∂a^L ∘ φ'^L(z^L)
  • For ℓ = L-1,...,1:

δ^ℓ = (W^{ℓ+1})^T δ^{ℓ+1} ∘ φ'^ℓ(z^ℓ)

  • Gradients:

∂L/∂W^ℓ = δ^ℓ (a^{ℓ-1})^T ∂L/∂b^ℓ = sumoverbatch(δ^ℓ) (or just δ^ℓ for single sample)

This is the vectorized, matrix-based backprop used in practice.


Backpropagation algorithm (step-by-step)

High-level training loop:

  1. For each mini-batch:

a. Forward pass: compute activations a^ℓ and pre-activations z^ℓ for ℓ = 1..L. b. Compute loss L and initial gradient ∂L/∂a^L. c. Backward pass: compute δ^ℓ using recursion above. d. Compute parameter gradients ∂L/∂W^ℓ and ∂L/∂b^ℓ. e. Update parameters with chosen optimizer.

Pseudocode:

```

Forward

a[0] = x for l in 1..L: z[l] = W[l] @ a[l-1] + b[l] a[l] = phi[l](z[l])

Compute loss

loss = Loss(a[L], y)

Backward

delta[L] = dLossda(a[L], y) phiprime[L](z[L]) for l in L..1: dW[l] = delta[l] @ a[l-1].T db[l] = sum(delta[l] over batch) if l > 1: delta[l-1] = (W[l].T @ delta[l]) phi_prime[l-1](z[l-1])

Update parameters, e.g.

for l in 1..L: W[l] -= learningrate dW[l] b[l] -= learningrate db[l] ```

Complexity: For a dense feedforward network, computing gradients has similar complexity to a forward pass (approximately 2×–3× the cost of forward pass if implemented carefully).


Computational graph perspective

Backpropagation arises naturally when you view the model as a computational graph where nodes compute functions of their inputs. Reverse-mode automatic differentiation (AD) traverses the graph from outputs backward, accumulating gradients for each node using local derivatives. Reverse-mode AD is efficient when the function maps many inputs (parameters) to few outputs (loss), which is the usual case.

Key ideas:

  • Store intermediate forward values (z^ℓ, a^ℓ) because backward uses them.
  • Compute local gradients at each node and combine via chain rule.
  • Use dynamic programming to avoid repeated computation.

Example: numeric walk-through (small network)

Consider single input x=1.0, hidden size 2, one output. Let: W1 = [[0.1, 0.2]]^T? Better give explicit:

x = [1.0] (scalar) W1 = [[0.5], [-0.3]] (2x1) b1 = [0.0, 0.0] φ = tanh W2 = [0.4, -0.2] (1x2) b2 = 0.0 Loss = 0.5*(ŷ - y)^2 with y = 0.0

Forward: z1 = W1x + b1 = [0.5, -0.3] a1 = tanh(z1) ≈ [0.4621, -0.2913] ŷ = W2·a1 + b2 = 0.40.4621 + (-0.2)(-0.2913) ≈ 0.1848 + 0.0583 = 0.2431 Loss = 0.5(0.2431)^2 ≈ 0.0296

Backward: dLoss/dŷ = ŷ - y = 0.2431 For output layer (no activation): δ2 = 0.2431 Grad W2 = δ2 a1 = [0.24310.4621, 0.2431(-0.2913)] ≈ [0.1123, -0.0708] Backprop to hidden pre-activations: dL/da1 = W2^T δ2 = [0.4, -0.2]0.2431 = [0.0972, -0.0486] δ1 = dL/da1 ∘ φ'(z1), where φ'(z) = 1 - tanh^2(z) φ'(z1) ≈ [1 - 0.4621^2, 1 - (-0.2913)^2] = [0.7863, 0.9152] δ1 ≈ [0.09720.7863, -0.04860.9152] = [0.0764, -0.0445] Grad W1 = δ1 x = δ1 (since x=1.0) → [0.0764, -0.0445] These gradients are then used to update weights.

This numeric example illustrates how error distributes to parameters.


Implementations

NumPy vectorized example (single hidden layer)

```python import numpy as np

def sigmoid(x): return 1 / (1 + np.exp(-x)) def sigmoid_prime(x): s = sigmoid(x) return s * (1 - s)

Simple network: inputdim -> hiddendim -> output_dim

def forward(x, W1, b1, W2, b2): z1 = W1.dot(x) + b1 a1 = sigmoid(z1) z2 = W2.dot(a1) + b2 yhat = z2 # linear output cache = (x, z1, a1, z2) return yhat, cache

def backward(yhat, y, cache, W2): x, z1, a1, z2 = cache dz2 ...

Ready to see the full tree?

Clone the preview to open the complete learning structure, practice tools, and generated study materials.