What is backpropagation in neural networks?

May 9, 2026··

13 min read

What is Backpropagation in Neural Networks?

Backpropagation (backward propagation of errors) is the foundational algorithm for training artificial neural networks. It efficiently computes gradients of a network’s loss function with respect to its parameters (weights and biases) so that optimization algorithms (typically gradient descent and its variants) can update the parameters to reduce error. Backpropagation is not an optimizer itself — it’s a technique for computing derivatives using the chain rule across the computational graph of the network.

This article is a comprehensive deep dive: history, mathematical foundations, detailed derivations, pseudocode and working code, practical issues, advanced variants, limitations, and future directions.

Table of contents

Introduction and intuition
Historical background
Mathematical foundations
- Scalar chain rule review
- Derivation for a single hidden layer
- Vectorized form for deep networks
Backpropagation algorithm (step-by-step)
- Pseudocode
- Computational graph perspective
Example: numeric walk-through (small network)
Implementations
- NumPy vectorized example
- PyTorch example (autodiff)
- Gradient checking
Practical considerations and training tricks
- Initialization
- Activation choices
- Learning rates and optimizers
- Mini-batches
- Regularization: weight decay, dropout
- BatchNorm, layer norm, residual connections
- Gradient clipping and normalization
Advanced topics
- Backpropagation through time (BPTT) for RNNs
- Truncated BPTT
- Second-order methods and Hessian information
- Automatic differentiation vs symbolic derivatives
- Memory vs compute trade-offs (checkpointing)
- Biologically plausible alternatives
Failure modes and limitations
- Vanishing and exploding gradients
- Saddle points and plateaus
- Non-differentiable operations and estimators
Current state of the art & research directions
Real-world applications and examples
Summary
Further reading and references

Introduction and intuition

At a high level, training a neural network involves:

Forward pass: compute the network output and the loss function value given inputs and current parameters.
Backward pass (backpropagation): compute gradients of the loss wrt each parameter.
Update parameters using an optimization algorithm (e.g., gradient descent, Adam).

Backpropagation answers the second step by using the chain rule to propagate derivative information from the outputs back to the inputs of each layer. Instead of computing each derivative from scratch, it reuses intermediate gradients computed at downstream nodes — making the computation efficient (O(number of edges) complexity).

Intuitively: if the output error depends on an intermediate quantity, and that intermediate depends on a parameter, the chain rule tells us how the parameter affects the error via the intermediate.

Historical background

The chain rule and gradient-based methods predate neural networks.
The modern backpropagation algorithm for neural networks became widely recognized in the 1980s after the seminal 1986 paper by Rumelhart, Hinton, and Williams, "Learning representations by back-propagating errors".
Earlier antecedents include work on automatic differentiation and multilayer perceptrons, but the 1986 paper popularized the method in neural network research.
Backpropagation is a form of reverse-mode automatic differentiation applied to neural networks and computational graphs.

Mathematical foundations

Scalar chain rule review

If y = f(u) and u = g(x), then dy/dx = (dy/du)*(du/dx). For compositions of many functions, chain rule applies repeatedly.

Derivation for a single hidden layer

Consider a simple feedforward network:

Input x ∈ R^n
Hidden layer: z = W1 x + b1, a = φ(z) (φ applied elementwise)
Output layer: ŷ = W2 a + b2
Loss: L(ŷ, y) (y is ground truth)

Goal: compute gradients ∂L/∂W2, ∂L/∂b2, ∂L/∂W1, ∂L/∂b1.

Stepwise:

δ_out = ∂L/∂ŷ (depends on loss and output).
∂L/∂W2 = δ_out a^T (if ŷ = W2 a + b2 and δ_out has same shape as ŷ).
∂L/∂b2 = δ_out (sum over batch if batched).
Propagate back to a: ∂L/∂a = W2^T δ_out.
For hidden pre-activation z: ∂L/∂z = (∂L/∂a) ∘ φ'(z) where ∘ denotes elementwise product.
∂L/∂W1 = (∂L/∂z) x^T.
∂L/∂b1 = ∂L/∂z.

This exemplifies how gradients are computed layer-by-layer from output to input.

Vectorized form for deep networks

For a deep network of L layers, denote:

at layer ℓ: z^ℓ = W^ℓ a^{ℓ-1} + b^ℓ, a^ℓ = φ^ℓ(z^ℓ)
a^0 = x, a^L = ŷ

Backprop recursion:

δ^L = ∂L/∂z^L = ∂L/∂a^L ∘ φ'^L(z^L)
For ℓ = L-1,...,1: δ^ℓ = (W^{ℓ+1})^T δ^{ℓ+1} ∘ φ'^ℓ(z^ℓ)
Gradients: ∂L/∂W^ℓ = δ^ℓ (a^{ℓ-1})^T ∂L/∂b^ℓ = sum_over_batch(δ^ℓ) (or just δ^ℓ for single sample)

This is the vectorized, matrix-based backprop used in practice.

Backpropagation algorithm (step-by-step)

High-level training loop:

For each mini-batch: a. Forward pass: compute activations a^ℓ and pre-activations z^ℓ for ℓ = 1..L. b. Compute loss L and initial gradient ∂L/∂a^L. c. Backward pass: compute δ^ℓ using recursion above. d. Compute parameter gradients ∂L/∂W^ℓ and ∂L/∂b^ℓ. e. Update parameters with chosen optimizer.

Pseudocode:

Plain Text

# Forward
a[0] = x
for l in 1..L:
    z[l] = W[l] @ a[l-1] + b[l]
    a[l] = phi[l](z[l])

# Compute loss
loss = Loss(a[L], y)

# Backward
delta[L] = dLoss_da(a[L], y) * phi_prime[L](z[L])
for l in L..1:
    dW[l] = delta[l] @ a[l-1].T
    db[l] = sum(delta[l] over batch)
    if l > 1:
        delta[l-1] = (W[l].T @ delta[l]) * phi_prime[l-1](z[l-1])

# Update parameters, e.g.
for l in 1..L:
    W[l] -= learning_rate * dW[l]
    b[l] -= learning_rate * db[l]

Complexity: For a dense feedforward network, computing gradients has similar complexity to a forward pass (approximately 2×–3× the cost of forward pass if implemented carefully).

Computational graph perspective

Backpropagation arises naturally when you view the model as a computational graph where nodes compute functions of their inputs. Reverse-mode automatic differentiation (AD) traverses the graph from outputs backward, accumulating gradients for each node using local derivatives. Reverse-mode AD is efficient when the function maps many inputs (parameters) to few outputs (loss), which is the usual case.

Key ideas:

Store intermediate forward values (z^ℓ, a^ℓ) because backward uses them.
Compute local gradients at each node and combine via chain rule.
Use dynamic programming to avoid repeated computation.

Example: numeric walk-through (small network)

Consider single input x=1.0, hidden size 2, one output. Let: W1 = [[0.1, 0.2]]^T? Better give explicit:

x = [1.0] (scalar) W1 = [[0.5], [-0.3]] (2x1) b1 = [0.0, 0.0] φ = tanh W2 = [0.4, -0.2] (1x2) b2 = 0.0 Loss = 0.5*(ŷ - y)^2 with y = 0.0

Forward: z1 = W1x + b1 = [0.5, -0.3] a1 = tanh(z1) ≈ [0.4621, -0.2913] ŷ = W2·a1 + b2 = 0.40.4621 + (-0.2)(-0.2913) ≈ 0.1848 + 0.0583 = 0.2431 Loss = 0.5(0.2431)^2 ≈ 0.0296

Backward: dLoss/dŷ = ŷ - y = 0.2431 For output layer (no activation): δ2 = 0.2431 Grad W2 = δ2 * a1 = [0.24310.4621, 0.2431(-0.2913)] ≈ [0.1123, -0.0708] Backprop to hidden pre-activations: dL/da1 = W2^T * δ2 = [0.4, -0.2]0.2431 = [0.0972, -0.0486] δ1 = dL/da1 ∘ φ'(z1), where φ'(z) = 1 - tanh^2(z) φ'(z1) ≈ [1 - 0.4621^2, 1 - (-0.2913)^2] = [0.7863, 0.9152] δ1 ≈ [0.09720.7863, -0.0486*0.9152] = [0.0764, -0.0445] Grad W1 = δ1 * x = δ1 (since x=1.0) → [0.0764, -0.0445] These gradients are then used to update weights.

This numeric example illustrates how error distributes to parameters.

Implementations

NumPy vectorized example (single hidden layer)

Python

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))
def sigmoid_prime(x):
    s = sigmoid(x)
    return s * (1 - s)

# Simple network: input_dim -> hidden_dim -> output_dim
def forward(x, W1, b1, W2, b2):
    z1 = W1.dot(x) + b1
    a1 = sigmoid(z1)
    z2 = W2.dot(a1) + b2
    yhat = z2  # linear output
    cache = (x, z1, a1, z2)
    return yhat, cache

def backward(yhat, y, cache, W2):
    x, z1, a1, z2 = cache
    dz2 = yhat - y  # dL/dz2 for MSE 0.5*(yhat-y)^2
    dW2 = dz2[:, None] * a1[None, :]  # outer product
    db2 = dz2
    da1 = W2.T.dot(dz2)
    dz1 = da1 * sigmoid_prime(z1)
    dW1 = dz1[:, None] * x[None, :]
    db1 = dz1
    return dW1, db1, dW2, db2

# Random init
np.random.seed(0)
x = np.array([0.5, -0.2])         # input vector (dim 2)
y = np.array([0.3])               # target (dim 1)
W1 = np.random.randn(3, 2) * 0.1  # hidden_dim=3, input_dim=2
b1 = np.zeros(3)
W2 = np.random.randn(1, 3) * 0.1
b2 = np.zeros(1)

yhat, cache = forward(x, W1, b1, W2, b2)
dW1, db1, dW2, db2 = backward(yhat, y, cache, W2)
print("yhat:", yhat, "dW1:", dW1, "dW2:", dW2)

This demonstrates the forward and backward passes manually. In modern frameworks, automatic differentiation handles this for you.

PyTorch example (autodiff)

Python

import torch
import torch.nn as nn
import torch.optim as optim

# Simple model
model = nn.Sequential(
    nn.Linear(2, 3),
    nn.ReLU(),
    nn.Linear(3, 1)
)

criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

x = torch.tensor([[0.5, -0.2]])
y = torch.tensor([[0.3]])

# Training step
optimizer.zero_grad()
yhat = model(x)
loss = criterion(yhat, y)
loss.backward()  # runs backprop automatically
optimizer.step()

PyTorch (and TensorFlow/JAX) use reverse-mode autodiff to compute gradients automatically, freeing the practitioner from writing backprop manually.

Gradient checking (numerical differentiation)

Gradient checking verifies your backprop math by approximating gradients numerically via finite differences:

Python

def numerical_grad(f, params, eps=1e-5):
    num_grads = []
    for i, p in enumerate(params):
        g = np.zeros_like(p)
        it = np.nditer(p, flags=['multi_index'], op_flags=['readwrite'])
        while not it.finished:
            idx = it.multi_index
            old = p[idx]
            p[idx] = old + eps
            f_plus = f()
            p[idx] = old - eps
            f_minus = f()
            p[idx] = old
            g[idx] = (f_plus - f_minus) / (2*eps)
            it.iternext()
        num_grads.append(g)
    return num_grads

Compare numerical and analytic gradients; differences should be small (e.g., <1e-6 relative).

Practical considerations and training tricks

Backpropagation provides gradients; training success depends on many additional choices.

Initialization:
- Poor initial weights cause vanishing/exploding signals. Use Glorot/Xavier, He initialization depending on activation (Xavier for tanh/sigmoid, He for ReLU).
Activation functions:
- Sigmoid/tanh can saturate; ReLU, Leaky ReLU, GELU, Swish often better for deep nets.
Learning rate:
- Critical hyperparameter. Use schedules (decay, warmup) or adaptive optimizers.
Optimizers:
- SGD with momentum, RMSProp, Adam, AdamW are popular. Adam converges fast initially; SGD+momentum often gives better final generalization for very large networks.
Mini-batches:
- Using mini-batch gradient descent improves gradient estimate stability and computational throughput.
Regularization:
- L2 weight decay, dropout, early stopping reduce overfitting.
Normalization:
- BatchNorm, LayerNorm stabilize training and mitigate covariate shift, often improving gradient flow.
Residual connections:
- Residual (skip) connections (ResNets) alleviate vanishing gradients allowing very deep networks.
Gradient clipping:
- Clip gradients by norm in RNNs or large networks to prevent exploding gradients.
Mixed precision training:
- Using FP16 reduces memory and speeds training; needs careful scaling of gradients.
Memory/computation:
- Store activations for backward pass consumes memory. Checkpointing trades compute for memory by recomputing activations during backward.

Advanced topics

Backpropagation through time (BPTT)

For recurrent networks, the network is unrolled across time steps and standard backprop is applied through the unrolled graph. BPTT computes gradients across time dependencies; can suffer from vanishing/exploding gradients for long sequences.

Truncated BPTT updates parameters after backpropagating across a limited time window to control compute and memory.

Second-order methods

Backprop gives first-order gradients. Second-order methods (Newton, quasi-Newton, natural gradient, K-FAC) use curvature (Hessian or approximations) to adjust step direction/size. They can accelerate convergence but are computationally costly at large scale.

Automatic differentiation vs symbolic derivatives

Reverse-mode AD computes exact derivatives algorithmically by traversing the computation graph (what modern deep learning frameworks do).
Symbolic differentiation manipulates expressions algebraically (not used in practice for large networks because of expression blowup).

Checkpointing (rematerialization)

To save memory, recompute intermediate activations on demand during the backward pass instead of storing all of them during forward pass. It's a trade-off between compute and memory.

Biologically plausible alternatives

Backprop is not considered biologically plausible by many neuroscientists. Alternatives include local learning rules, feedback alignment, equilibrium propagation, or target propagation — active research area.

Failure modes and limitations

Vanishing gradients: gradients shrink exponentially with depth/time (RNNs), preventing learning of early layers or long-term dependencies.
Exploding gradients: gradients grow exponentially; cause instability unless clipped or weights constrained.
Saddle points: high-dimensional nonconvex landscapes contain many saddle points where gradient is near zero but not a minimum.
Non-differentiable operations: some operations (e.g., discrete sampling) require surrogate gradients (e.g., straight-through estimator) or reparameterization tricks.
Memory: backprop requires storing intermediate activations for reverse pass; for very large models this is expensive.

Mitigations: proper initialization, normalization (BatchNorm), residual connections, gradient clipping, specialized architectures (LSTM/GRU for RNNs).

Current state of the art & research directions

Backprop remains the standard for training deep networks. Key directions and innovations:

Scalable training of massive models (transformers) with mixed-precision, data-parallel and model-parallel methods.
Optimizer research (Adam variants, Adafactor, Lion) and understanding generalization.
Improved normalization schemes and normalization-free training.
Second-order and curvature-aware optimizers for faster convergence (e.g., K-FAC).
Techniques for reducing memory and compute: gradient checkpointing, sparse models, pruning, quantization.
Research on biologically plausible algorithms and local learning rules.
Better handling of non-differentiables in reinforcement learning (policy gradients, Q-learning).
Research on stability and interpretability of gradients.

Real-world applications and examples

Backprop-powered neural networks are ubiquitous:

Computer vision: CNNs for classification, detection, segmentation.
Natural language processing: Transformers trained with backprop on massive corpora.
Speech and audio processing: acoustic models, speech synthesis.
Reinforcement learning: policy/value networks use backprop for parameter updates either via policy gradients or by fitting Q-functions.
Recommendation systems and embeddings.
Generative models: GANs and VAEs train using gradients from losses and adversarial objectives.
Scientific modeling: physics-informed networks, differential equation solvers using neural nets and differentiable programming.

Practical example: fine-tuning a pretrained transformer for text classification uses backprop through the entire model to adjust weights using cross-entropy loss and Adam optimizer.

Summary

Backpropagation is an efficient algorithm for computing gradients of a loss with respect to neural network parameters by applying the chain rule across the computational graph.
It is implemented by reverse-mode automatic differentiation and is central to modern deep learning.
Successful application of backprop requires attention to initialization, activation functions, normalization, optimization algorithms, and many practical tricks to avoid vanishing/exploding gradients and to generalize well.
Despite limitations and debates about biological plausibility, backprop remains the dominant tool for training deep neural networks, and continues to be refined and adapted to ever-larger models and tasks.

What is Backpropagation in Neural Networks?

Introduction and intuition

Historical background

Mathematical foundations

Scalar chain rule review

Derivation for a single hidden layer

Vectorized form for deep networks

Backpropagation algorithm (step-by-step)

Computational graph perspective

Example: numeric walk-through (small network)

Implementations

NumPy vectorized example (single hidden layer)

PyTorch example (autodiff)

Gradient checking (numerical differentiation)

Practical considerations and training tricks

Advanced topics

Backpropagation through time (BPTT)

Second-order methods

Automatic differentiation vs symbolic derivatives

Checkpointing (rematerialization)

Biologically plausible alternatives

Failure modes and limitations

Current state of the art & research directions

Real-world applications and examples

Summary

Further reading and references