What are activation functions in deep learning?
Activation functions are a fundamental component of artificial neural networks: they determine how the weighted sum of a neuron's inputs is transformed into its output. Without nonlinear activation functions, a multilayer network would collapse to a linear model regardless of depth. Activation functions shape the network's representational power, optimization behavior, numerical stability, sparsity, and ultimately performance on tasks.
This article is a deep dive into activation functions: history, mathematical properties, common families, theoretical foundations, practical guidance, code examples, current research directions, and future implications.
Contents
- Introduction and role of activation functions
- Brief history and evolution
- Mathematical properties and practical considerations
- Common activation functions (formulas, derivatives, pros/cons)
- Output-layer activations (classification/regression)
- Theoretical foundations (universal approximation, gradients, signal propagation)
- Practical guidance and best practices
- Implementation examples (PyTorch / TensorFlow snippets)
- Advanced/novel activations and research directions
- Future implications
- Summary
- Key references and further reading
Introduction and role of activation functions
A neuron computes a weighted sum of inputs plus a bias: `` z = w·x + b ` The activation function φ transforms z into the neuron's output: ` a = φ(z) `` Key roles:
- Introduce nonlinearity so networks can approximate complex functions (universal approximation).
- Shape gradient flow during training (impacting vanishing/exploding gradients).
- Influence sparsity and information propagation.
- Affect convergence speed and generalization.
Without nonlinear φ, stacking layers yields another linear transformation: `` φ(z) = z => linear network `` so depth adds no representational power.
Brief history and evolution
- 1943: McCulloch and Pitts proposed binary threshold (step) neurons — first formalized artificial neuron.
- 1958: Rosenblatt's perceptron used step activation; limited to linearly separable problems.
- 1970s–80s: Sigmoid (logistic) and tanh activations became common; differentiable, enabling gradient-based learning.
- 1980s: Backpropagation popularized training multi-layer networks with differentiable activations.
- 1990s–2000s: Sigmoid/tanh used widely, but deeper networks suffered from vanishing gradients.
- 2010–2012: ReLU (Rectified Linear Unit) surged in popularity (Nair & Hinton 2010; widespread use after AlexNet 2012) because of simplicity and improved gradient flow.
- 2010s onward: ReLU variants (LeakyReLU, PReLU), ELU, SELU, and later GELU, Swish, Mish, and hard approximations for mobile/hardware efficiency.
- Present: Activation choice is task-dependent: ReLU default for CNNs, GELU/Swish common in transformers; specialized activations in quantized or spiking networks.
Mathematical properties and practical considerations
When choosing or designing activation functions, consider:
- Differentiability: required for gradient-based optimization (strict differentiability not strictly necessary — ReLU has subgradient at 0).
- Saturation: functions that saturate (sigmoid, tanh) have near-zero gradients far from 0, causing vanishing gradients.
- Zero-centeredness: activations centered around zero (tanh) help balance gradients; positive-only outputs (ReLU, sigmoid) may introduce bias in activations.
- Monotonicity: monotonic activations simplify optimization analysis; non-monotonic activations (Swish, Mish) can sometimes improve performance.
- Boundedness: bounded outputs (sigmoid/tanh) limit activation magnitude; unbounded activations (ReLU) can grow large and risk exploding activations if not controlled.
- Sparsity: ReLU yields exact zeros for negative inputs, encouraging sparse activations and computational efficiency.
- Smoothness: smooth activations (ELU, softplus, Swish) can improve optimization by providing continuous gradients.
- Computational cost: simple operations (max, multiplication) are faster and more hardware-friendly than complex functions.
- Robustness to weight initialization: some activations require careful initialization (SELU has special requirements).
- Quantization/efficiency: hardware-friendly approximations (hard-swish, hard-sigmoid) are used for mobile networks.
Common activation functions
Below are commonly used activations with formula, derivative, and pros/cons.
Note: φ'(z) denotes derivative wrt z.
1. Linear (identity)
Formula: `` φ(z) = z φ'(z) = 1 `` Use: output layer for regression; not used in hidden layers (would make network linear).
Pros: simple; preserves scale. Cons: no nonlinearity.
2. Step (Heaviside / binary threshold)
Formula: `` φ(z) = 1 if z >= 0 else 0 `` Derivative: zero almost everywhere (not usable for gradient descent).
Use: historical, perceptron. Cons: non-differentiable, unsuitable for gradient-based learning.
3. Sigmoid (logistic)
Formula: `` φ(z) = 1 / (1 + exp(-z)) φ'(z) = φ(z) * (1 - φ(z)) `` Pros: smooth, outputs in (0,1) => interpretable as probability (binary classification). Cons: saturates for large |z| => vanishing gradients; outputs not zero-centered; slower convergence.
4. Tanh (hyperbolic tangent)
Formula: `` φ(z) = tanh(z) = (exp(z) - exp(-z)) / (exp(z) + exp(-z)) φ'(z) = 1 - tanh(z)^2 `` Pros: zero-centered outputs in (-1,1); stronger gradients near origin than sigmoid. Cons: still saturates causing vanishing gradients; slower for deep networks.
5. ReLU (Rectified Linear Unit)
Formula: `` φ(z) = max(0, z) φ'(z) = 1 if z > 0 else 0 (subgradient at 0) `` Pros:
- Simple and computationally cheap
- Sparse activations (many zeros)
- Mitigates vanishing gradient for positive z
- Works well empirically in deep CNNs/MLPs
Cons:
- "Dying ReLU": neurons can become permanently inactive if weights push inputs negative
- Unbounded outputs (can cause large activations)
- Not differentiable at 0 (handled via subgradient)
6. Leaky ReLU
Formula: `` φ(z) = z if z > 0 else α*z, (α small, e.g., 0.01) φ'(z) = 1 if z > 0 else α `` Pros: avoids dying ReLU by allowing small gradient for negative z. Cons: α fixed; not learned (though PReLU addresses this).
7. Parametric ReLU (PReLU)
Formula: `` φ(z) = z if z > 0 else a*z, with learnable a `` Pros: adaptively learns negative slope; tends to improve performance. Cons: extra parameters; risk of overfitting small models.
8. ELU (Exponential Linear Unit)
Formula: `` φ(z) = z if z >= 0 else α(exp(z) - 1) φ'(z) = 1 if z >= 0 else αexp(z) `` Typical α = 1.
Pros:
- Negative outputs push mean activations toward zero (helps learning)
- Smooth for z < 0 (no abrupt slope change like ReLU)
Cons:
- Slightly more expensive to compute
- Not self-normalizing; requires careful initialization/BN
9. SELU (Scaled ELU) — Self-Normalizing Neural Networks
Formula: `` φ(z) = λ (z if z > 0 else α(exp(z) - 1)) `` with specific α ≈ 1.6733 and λ ≈ 1.0507.
Pros:
- Encourages activations to converge to zero mean and unit variance when used with appropriate initialization and architecture (no BatchNorm needed).
Cons:
- Requires architecture constraints (dense feed-forward, no dropout unless scaled) and specific initialization; less commonly used in conv nets.
10. Softplus
Formula: `` φ(z) = log(1 + exp(z)) (smooth ReLU) φ'(z) = 1 / (1 + exp(-z)) = sigmoid(z) `` Pros: smooth approximation to ReLU, biologically plausible. Cons: more expensive, non-sparse; large z leads to numerical issues if not handled.
11. Softsign
Formula: `` φ(z) = z / (1 + |z|) φ'(z) = 1 / (1 + |z|)^2 `` Less common; smoother, bounded.
12. GELU (Gaussian Error Linear Unit)
Formula (approx): `` φ(z) = z Φ(z) where Φ is standard normal CDF. Common approximation: z 0.5 (1 + tanh(√(2/π) (z + 0.044715 z^3))) `` Used in Transformers (BERT) and modern NLP models.
Pros: Smooth, non-monotonic, empirically better for large models like transformers. Cons: slightly more expensive than ReLU; complex formula.
13. Swish
Formula: `` φ(z) = z * sigmoid(β z) (β often 1; can be trainable) `` Pros: smooth, non-monotonic, often outperforms ReLU on some tasks. Cons: more expensive; benefits depend on architecture.
14. Mish
Formula: `` φ(z) = z tanh(softplus(z)) = z tanh(log(1 + exp(z))) `` Pros: smooth, non-monotonic; reported improvements in some vision tasks. Cons: computational cost; gains context-dependent.
15. Hard approximations (Hard-Sigmoid, Hard-Swish)
Formulas use piecewise-linear approximations for efficiency on mobile devices (used in MobileNetV3). Pros: hardware-friendly, faster; suitable for quantization. Cons: approximate, might slightly reduce accuracy vs. smooth versions.
16. Softmax (for multi-class output)
Formula (for vector z): `` softmax(zi) = exp(zi) / sumj exp(zj) `` Use: final layer for mutually exclusive multi-class classification. Typically combined with cross-entropy loss.
Properties: outputs sum to 1, interpretable as probabilities; gradients combined with cross-entropy have numerically stable forms.
Output-layer activations: choosing based on task
- Binary classification (single output): Sigmoid + binary cross-entropy (BCE). For multi-label classification with independent labels, use sigmoid on each output.
- Multi-class, mutually exclusive classification: Softmax + categorical cross-entropy.
- Regression: Linear output (identity). For bounded target ranges, one can use tanh scaled appropriately.
- Probabilistic outputs for ordinal/structured tasks may use more specialized final transforms.
Note: In frameworks, use numerically stable combined loss functions (e.g., TensorFlow's tf.nn.sigmoidcrossentropywith...