What are activation functions in deep learning?

Activation functions are a fundamental component of artificial neural networks: they determine how the weighted sum of a neuron's inputs is transformed into its output. Without nonlinear activation functions, a multilayer network would collapse to a linear model regardless of depth. Activation functions shape the network's representational power, optimization behavior, numerical stability, sparsity, and ultimately performance on tasks.

This article is a deep dive into activation functions: history, mathematical properties, common families, theoretical foundations, practical guidance, code examples, current research directions, and future implications.

Contents

  • Introduction and role of activation functions
  • Brief history and evolution
  • Mathematical properties and practical considerations
  • Common activation functions (formulas, derivatives, pros/cons)
  • Output-layer activations (classification/regression)
  • Theoretical foundations (universal approximation, gradients, signal propagation)
  • Practical guidance and best practices
  • Implementation examples (PyTorch / TensorFlow snippets)
  • Advanced/novel activations and research directions
  • Future implications
  • Summary
  • Key references and further reading

Introduction and role of activation functions

A neuron computes a weighted sum of inputs plus a bias:

z = w·x + b

The activation function φ transforms z into the neuron's output:

a = φ(z)

Key roles:

  • Introduce nonlinearity so networks can approximate complex functions (universal approximation).
  • Shape gradient flow during training (impacting vanishing/exploding gradients).
  • Influence sparsity and information propagation.
  • Affect convergence speed and generalization.

Without nonlinear φ, stacking layers yields another linear transformation:

φ(z) = z => linear network

so depth adds no representational power.


Brief history and evolution

  • 1943: McCulloch and Pitts proposed binary threshold (step) neurons — first formalized artificial neuron.
  • 1958: Rosenblatt's perceptron used step activation; limited to linearly separable problems.
  • 1970s–80s: Sigmoid (logistic) and tanh activations became common; differentiable, enabling gradient-based learning.
  • 1980s: Backpropagation popularized training multi-layer networks with differentiable activations.
  • 1990s–2000s: Sigmoid/tanh used widely, but deeper networks suffered from vanishing gradients.
  • 2010–2012: ReLU (Rectified Linear Unit) surged in popularity (Nair & Hinton 2010; widespread use after AlexNet 2012) because of simplicity and improved gradient flow.
  • 2010s onward: ReLU variants (LeakyReLU, PReLU), ELU, SELU, and later GELU, Swish, Mish, and hard approximations for mobile/hardware efficiency.
  • Present: Activation choice is task-dependent: ReLU default for CNNs, GELU/Swish common in transformers; specialized activations in quantized or spiking networks.

Mathematical properties and practical considerations

When choosing or designing activation functions, consider:

  • Differentiability: required for gradient-based optimization (strict differentiability not strictly necessary — ReLU has subgradient at 0).
  • Saturation: functions that saturate (sigmoid, tanh) have near-zero gradients far from 0, causing vanishing gradients.
  • Zero-centeredness: activations centered around zero (tanh) help balance gradients; positive-only outputs (ReLU, sigmoid) may introduce bias in activations.
  • Monotonicity: monotonic activations simplify optimization analysis; non-monotonic activations (Swish, Mish) can sometimes improve performance.
  • Boundedness: bounded outputs (sigmoid/tanh) limit activation magnitude; unbounded activations (ReLU) can grow large and risk exploding activations if not controlled.
  • Sparsity: ReLU yields exact zeros for negative inputs, encouraging sparse activations and computational efficiency.
  • Smoothness: smooth activations (ELU, softplus, Swish) can improve optimization by providing continuous gradients.
  • Computational cost: simple operations (max, multiplication) are faster and more hardware-friendly than complex functions.
  • Robustness to weight initialization: some activations require careful initialization (SELU has special requirements).
  • Quantization/efficiency: hardware-friendly approximations (hard-swish, hard-sigmoid) are used for mobile networks.

Common activation functions

Below are commonly used activations with formula, derivative, and pros/cons.

Note: φ'(z) denotes derivative wrt z.

1. Linear (identity)

Formula:

Plain Text
φ(z) = z φ'(z) = 1

Use: output layer for regression; not used in hidden layers (would make network linear).

Pros: simple; preserves scale.
Cons: no nonlinearity.


2. Step (Heaviside / binary threshold)

Formula:

φ(z) = 1 if z >= 0 else 0

Derivative: zero almost everywhere (not usable for gradient descent).

Use: historical, perceptron.
Cons: non-differentiable, unsuitable for gradient-based learning.


3. Sigmoid (logistic)

Formula:

Plain Text
φ(z) = 1 / (1 + exp(-z)) φ'(z) = φ(z) * (1 - φ(z))

Pros: smooth, outputs in (0,1) => interpretable as probability (binary classification).
Cons: saturates for large |z| => vanishing gradients; outputs not zero-centered; slower convergence.


4. Tanh (hyperbolic tangent)

Formula:

Plain Text
φ(z) = tanh(z) = (exp(z) - exp(-z)) / (exp(z) + exp(-z)) φ'(z) = 1 - tanh(z)^2

Pros: zero-centered outputs in (-1,1); stronger gradients near origin than sigmoid.
Cons: still saturates causing vanishing gradients; slower for deep networks.


5. ReLU (Rectified Linear Unit)

Formula:

Plain Text
φ(z) = max(0, z) φ'(z) = 1 if z > 0 else 0 (subgradient at 0)

Pros:

  • Simple and computationally cheap
  • Sparse activations (many zeros)
  • Mitigates vanishing gradient for positive z
  • Works well empirically in deep CNNs/MLPs

Cons:

  • "Dying ReLU": neurons can become permanently inactive if weights push inputs negative
  • Unbounded outputs (can cause large activations)
  • Not differentiable at 0 (handled via subgradient)

6. Leaky ReLU

Formula:

Plain Text
φ(z) = z if z > 0 else α*z, (α small, e.g., 0.01) φ'(z) = 1 if z > 0 else α

Pros: avoids dying ReLU by allowing small gradient for negative z.
Cons: α fixed; not learned (though PReLU addresses this).


7. Parametric ReLU (PReLU)

Formula:

φ(z) = z if z > 0 else a*z, with learnable a

Pros: adaptively learns negative slope; tends to improve performance.
Cons: extra parameters; risk of overfitting small models.


8. ELU (Exponential Linear Unit)

Formula:

Plain Text
φ(z) = z if z >= 0 else α*(exp(z) - 1) φ'(z) = 1 if z >= 0 else α*exp(z)

Typical α = 1.

Pros:

  • Negative outputs push mean activations toward zero (helps learning)
  • Smooth for z < 0 (no abrupt slope change like ReLU) Cons:
  • Slightly more expensive to compute
  • Not self-normalizing; requires careful initialization/BN

9. SELU (Scaled ELU) — Self-Normalizing Neural Networks

Formula:

φ(z) = λ * (z if z > 0 else α*(exp(z) - 1))

with specific α ≈ 1.6733 and λ ≈ 1.0507.

Pros:

  • Encourages activations to converge to zero mean and unit variance when used with appropriate initialization and architecture (no BatchNorm needed). Cons:
  • Requires architecture constraints (dense feed-forward, no dropout unless scaled) and specific initialization; less commonly used in conv nets.

10. Softplus

Formula:

Plain Text
φ(z) = log(1 + exp(z)) (smooth ReLU) φ'(z) = 1 / (1 + exp(-z)) = sigmoid(z)

Pros: smooth approximation to ReLU, biologically plausible.
Cons: more expensive, non-sparse; large z leads to numerical issues if not handled.


11. Softsign

Formula:

Plain Text
φ(z) = z / (1 + |z|) φ'(z) = 1 / (1 + |z|)^2

Less common; smoother, bounded.


12. GELU (Gaussian Error Linear Unit)

Formula (approx):

Plain Text
φ(z) = z * Φ(z) where Φ is standard normal CDF. Common approximation: z * 0.5 * (1 + tanh(√(2/π) * (z + 0.044715 z^3)))

Used in Transformers (BERT) and modern NLP models.

Pros: Smooth, non-monotonic, empirically better for large models like transformers.
Cons: slightly more expensive than ReLU; complex formula.


13. Swish

Formula:

φ(z) = z * sigmoid(β z) (β often 1; can be trainable)

Pros: smooth, non-monotonic, often outperforms ReLU on some tasks.
Cons: more expensive; benefits depend on architecture.


14. Mish

Formula:

φ(z) = z * tanh(softplus(z)) = z * tanh(log(1 + exp(z)))

Pros: smooth, non-monotonic; reported improvements in some vision tasks.
Cons: computational cost; gains context-dependent.


15. Hard approximations (Hard-Sigmoid, Hard-Swish)

Formulas use piecewise-linear approximations for efficiency on mobile devices (used in MobileNetV3). Pros: hardware-friendly, faster; suitable for quantization.
Cons: approximate, might slightly reduce accuracy vs. smooth versions.


16. Softmax (for multi-class output)

Formula (for vector z):

softmax(z_i) = exp(z_i) / sum_j exp(z_j)

Use: final layer for mutually exclusive multi-class classification. Typically combined with cross-entropy loss.

Properties: outputs sum to 1, interpretable as probabilities; gradients combined with cross-entropy have numerically stable forms.


Output-layer activations: choosing based on task

  • Binary classification (single output): Sigmoid + binary cross-entropy (BCE). For multi-label classification with independent labels, use sigmoid on each output.
  • Multi-class, mutually exclusive classification: Softmax + categorical cross-entropy.
  • Regression: Linear output (identity). For bounded target ranges, one can use tanh scaled appropriately.
  • Probabilistic outputs for ordinal/structured tasks may use more specialized final transforms.

Note: In frameworks, use numerically stable combined loss functions (e.g., TensorFlow's tf.nn.sigmoid_cross_entropy_with_logits or PyTorch's BCEWithLogitsLoss) that expect logits rather than post-sigmoid probabilities.


Theoretical foundations

Universal approximation theorem

A feed-forward network with at least one hidden layer and a non-linear activation function (continuous, bounded, non-constant) can approximate any continuous function on a compact domain to arbitrary precision, given enough neurons. Key point: nonlinearity is essential.

Gradient-based optimization and activation choice

Activation derivatives determine gradient magnitudes. If derivatives are small over wide input ranges (sigmoid/tanh saturate), gradients vanish as they are multiplied through layers, making training deep nets hard. ReLU addresses this by having derivative 1 for positive inputs.

Signal propagation and mean-field theory

Recent theory analyzes how signals and gradients propagate in deep networks (Poole et al., Schoenholz et al.). Key concepts:

  • Order-to-chaos transition: depending on weight variance and activation, signals either contract to zero or explode. Proper initialization (Xavier, He) depends on activation to maintain stable variance across layers.
  • Dynamical isometry: having Jacobian singular values near 1 helps gradient propagation; orthogonal initialization and certain activations can help.

Neural tangent kernel (NTK) and infinite-width limit

In the infinite-width regime, networks behave like kernel methods; activation function determines the kernel/feature map. Different activations correspond to different kernel functions and inductive biases.

Optimization landscape and non-monotonic activations

Non-monotonic smooth activations (Swish, Mish) can create favorable loss geometries in practice, sometimes allowing better minima or improved generalization.


Practical guidance and best practices

  • Default choice: ReLU for hidden layers is still a strong baseline for many tasks (vision, CNNs). GELU recommended for Transformer-style architectures.
  • If encountering dying ReLUs (many zero activations), try LeakyReLU, PReLU, or use proper initialization and smaller learning rates.
  • For recurrent nets (RNN/LSTM/GRU), tanh and sigmoid remain common for gating and state transforms; careful initialization and normalization help.
  • Use appropriate initialization:
    • Xavier/Glorot init for tanh/sigmoid
    • He/Kaiming init for ReLU variants
    • SELU has its own recommended initialization (LeCun normal)
  • Combine activations with normalization:
    • BatchNorm often allows using ReLU and improves training stability.
    • SELU attempts to obviate BatchNorm in specific setups.
  • For output layers, prefer using logits with numerically stable losses provided by frameworks (BCEWithLogitsLoss, CrossEntropyLoss).
  • For mobile/hardware-constrained models, use hard-swish/hard-sigmoid or ReLU6 for quantization-friendly activations.
  • Monitor gradients and activation distributions (histograms) during training. Signs of poor activation behavior:
    • Most activations zero (ReLU) => dying ReLU problem.
    • Activation variance exploding/vanishing between layers => bad initialization or learning rate.
  • Regularization: sometimes activation choice interacts with dropout and BN differently. E.g., SELU discourages dropout.

Implementation examples

PyTorch examples:

Using common activations:

Python
1import torch 2import torch.nn as nn 3import torch.nn.functional as F 4 5x = torch.randn(4, 10) 6 7# ReLU 8y_relu = F.relu(x) 9 10# LeakyReLU 11leaky = nn.LeakyReLU(negative_slope=0.01) 12y_leaky = leaky(x) 13 14# GELU 15y_gelu = F.gelu(x) # PyTorch builtin 16 17# Swish (β=1) 18def swish(x, beta=1.0): 19 return x * torch.sigmoid(beta * x) 20 21y_swish = swish(x) 22 23# Mish 24def mish(x): 25 return x * torch.tanh(F.softplus(x)) 26 27y_mish = mish(x) 28 29# Output activations example: logits -> loss (use with BCEWithLogitsLoss) 30logits = torch.randn(4) 31labels = torch.randint(0, 2, (4,)).float() 32loss_fn = nn.BCEWithLogitsLoss() 33loss = loss_fn(logits, labels)

TensorFlow / Keras:

Python
1from tensorflow.keras import layers, activations 2 3x = tf.random.normal((4, 10)) 4y = activations.relu(x) 5y_gelu = activations.gelu(x) # TF has gelu in tf.keras.activations in recent versions

Custom activation with learnable parameter (PReLU example in PyTorch):

Python
1class SimpleNet(nn.Module): 2 def __init__(self, in_dim, hidden_dim, out_dim): 3 super().__init__() 4 self.fc1 = nn.Linear(in_dim, hidden_dim) 5 self.prelu = nn.PReLU() # learnable alpha 6 self.fc2 = nn.Linear(hidden_dim, out_dim) 7 8 def forward(self, x): 9 x = self.fc1(x) 10 x = self.prelu(x) 11 return self.fc2(x)

Plotting activations and derivatives (useful to inspect behavior).


Advanced and novel activation directions (research)

  • Learned activation functions:
    • PReLU learns negative slope.
    • More generally, parameterized activations (trainable coefficients) and adaptive piecewise linear units (APL) can learn shape from data.
  • Rational activations: approximate activation by ratio of polynomials; can be optimized and offer compact, flexible functions.
  • Spline-based activations: using splines to learn smooth activation shapes.
  • Mixture-of-activations: combining several activations or gating between them.
  • Activation search (AutoML/NAS): searching for optimal activation shapes as part of architecture search.
  • Non-monotone activations (Swish, Mish) that improve performance on large-scale tasks.
  • Binary/ternary activations for energy-efficient inference (Binarized Neural Networks).
  • Spiking neuron activations for neuromorphic computing and event-based sensors (activation as spike generation).
  • Activation-aware initialization: initializations tailored to activation to preserve variance and gradient norms (He/Xavier/LeCun/orthogonal).
  • Theoretical work exploring connections between activation shapes, optimization landscapes, generalization, and NTK.

Current state and empirical observations

  • ReLU remains a robust baseline for many vision architectures due to simplicity and computational efficiency.
  • Transformer architectures and large-scale NLP models often use GELU (BERT) or Swish variants.
  • For small/embedded models, hard-swish and hard-sigmoid balance performance and efficiency.
  • Non-monotonic smooth activations (Swish/Mish) yield incremental improvements on some tasks, but gains are architecture- and dataset-dependent.
  • Activation choice interacts strongly with initialization, normalization, architecture depth, and optimizer; there is no one-size-fits-all perfect activation.
  • Research trends emphasize learned activations and functions that improve gradient flow and generalization in ever-larger models.

Future implications and directions

  • Learned and flexible activations: increasing automation (NAS, meta-learning) will likely include activation optimization as part of architecture search.
  • Hardware-aware activations: as edge inference grows, design of quantization- and hardware-friendly activations (e.g., piecewise linear) will be prioritized.
  • Activation design for very large models: activations that improve optimization and generalization for billion-parameter models will be further studied (GELU/Swish-like trends).
  • Theoretical advances: deeper understanding of how activation shape affects training dynamics (NTK, mean-field, dynamical isometry) could drive principled activation design.
  • Cross-pollination with neuroscience: spiking and energy-efficient activation mechanisms may influence low-power AI hardware.
  • Combining activations with normalization and regularization to reduce reliance on complex normalization layers.

Common pitfalls and debugging tips

  • If training stalls or diverges:
    • Check learning rate; high LR can cause exploding activations.
    • Inspect activation distributions per layer (too many zeros => dying ReLU; too large variance => exploding).
    • Use appropriate initialization for chosen activation.
  • If model underfits:
    • Consider more expressive activations (Swish, Mish) or adding capacity.
  • If model overfits with complex activations:
    • Use regularization or simpler activations; parameterized activations add parameters that can overfit.
  • For numerical stability:
    • Work with logits for classification losses to avoid passing already-squashed probabilities into loss functions.
  • For mobile inference:
    • Prefer simple, piecewise-linear activations that quantize well.

Summary

Activation functions are central to deep learning: they introduce the nonlinearity necessary for deep models to learn complex functions and strongly influence optimization dynamics and generalization. From classical sigmoids and tanh to ReLU and its many variants (LeakyReLU, ELU, SELU) and modern smooth, non-monotonic activations (GELU, Swish, Mish), the design space is rich. Choices should be driven by architecture, task, hardware constraints, and empirical validation. Ongoing research is expanding activation design through learnable, hardware-aware, and theoretically grounded methods.


Key references and further reading (select)

  • McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity.
  • Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain.
  • Rumelhart, Hinton, Williams (1986). Learning representations by back-propagating errors.
  • Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. (Xavier init)
  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. (He/Kaiming init, ReLU analysis)
  • Clevert, D-A., Unterthiner, T., & Hochreiter, S. (2015). Fast and Accurate Deep Network Learning by Exponential Linear Units (ELU).
  • Klambauer, G., Unterthiner, T., Mayr, A., & Hochreiter, S. (2017). Self-Normalizing Neural Networks (SELU).
  • Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELU) paper (and subsequent adoption in Transformer architectures).
  • Ramachandran, P., Zoph, B., & Le, Q. V. (2017–2018). Searching for Activation Functions, Swish paper.
  • Misra, D. (2019). Mish: A Self Regularized Non-Monotonic Activation Function.
  • Poole, B. et al. (2016). Exponential expressivity in deep neural networks through transient chaos.
  • Schoenholz, S. S. et al. (2017). Deep Information Propagation.

(For framework usage, consult PyTorch/TensorFlow documentation for numerically-stable loss/activation combinations.)


If you want, I can:

  • Provide a notebook that plots common activations and their derivatives,
  • Run small experiments comparing activations on a standard dataset (MNIST/CIFAR-10) and report results,
  • Show how to implement a learnable activation (e.g., rational activation) and train it end-to-end. Which would you prefer?