A learning path ready to make your own.

What are activation functions in deep learning?

Activation functions in deep learning — concise overview Activation functions transform a neuron's weighted sum z = w·x + b into an output a = φ(z). They introduce the nonlinearity essential for deep networks, determine gradient flow, influence sparsity and numerical stability, and strongly affect optimization and generalization. Key roles Nonlinearity: enables universal approximation; without it multilayer nets collapse to linear maps. Gradient shaping: derivatives control vanishing/exploding gradients and training dynamics. Signal & information propagation: affects activation distributions, sparsity, and optimization speed. Hardware/efficiency: computational cost and quantization friendliness matter for deployment. Brief history 1943–1958: Step/threshold neurons and the perceptron. 1970s–1980s: Sigmoid and tanh enable backpropagation for differentiable learning. 2010s: ReLU popularized for deep CNNs; many variants follow (LeakyReLU, PReLU, ELU, SELU). Recent: Smooth/non-monotonic activations (GELU, Swish, Mish) used in large models (e.g., transformers); hard approximations for mobile. Practical mathematical considerations Differentiability: needed for gradient-based optimization (subgradients OK for ReLU). Saturation: bounded/saturating activations (sigmoid/tanh) → vanishing gradients. Zero-centeredness: helps balance gradients (tanh > sigmoid). Monotonicity & smoothness: non-monotonic smooth activations can yield better minima; smoothness aids optimization. Bounded vs. unbounded: unbounded (ReLU) can grow large; bounded limits activations. Sparsity & efficiency: ReLU yields sparse outputs; hard-piecewise forms suit quantization. Initialization dependence: choose Xavier/He/LeCun according to activation. Common activation functions (short) Linear (identity): use only for regression outputs. Step/Heaviside: historical; non-differentiable, not used for gradient learning. Sigmoid (logistic): smooth, probabilistic output (0,1); saturates → vanishing gradients. Tanh: zero-centered (-1,1); still saturates but often better than sigmoid. ReLU: max(0,z). Simple, sparse, effective baseline; can suffer "dying ReLU". LeakyReLU / PReLU: small negative slope (fixed or learnable) to avoid dead neurons. ELU / SELU: negative outputs help centering; SELU promotes self-normalization with special init. Softplus: smooth ReLU approximation; more expensive, non-sparse. GELU: z·Φ(z) (or tanh approximation); used in transformers (BERT); smooth, non-monotonic. Swish / Mish: smooth, non-monotonic variants with reported gains in some tasks; computationally costlier. Hard approximations (Hard-Sigmoid/Hard-Swish, ReLU6): piecewise-linear for mobile/quantized inference. Softmax: final-layer for mutually exclusive multi-class probability outputs (use with cross-entropy). Output-layer recommendations Binary classification: sigmoid + BCE (prefer logits + numerically stable loss functions). Multi-class (exclusive): softmax + categorical cross-entropy (use logits with framework losses). Regression: linear output (scale tanh if target is bounded). Multi-label: independent sigmoids per output. Theoretical foundations (brief) Universal approximation: nonlinearity is essential for approximating arbitrary continuous functions. Gradient propagation & initialization: activations and weight variance jointly determine whether signals/gradients vanish or explode (Xavier/He initializations depend on activation). Mean-field & NTK: activation shapes set signal dynamics and infinite-width kernel behavior, shaping inductive bias. Practical guidance & best practices Default: ReLU for many CNN/MLP tasks; GELU for transformer-style models. Use He/Kaiming init for ReLU variants, Xavier for tanh/sigmoid; SELU has its own init rules. Combine with normalization (BatchNorm) to improve stability; SELU can avoid BN in constrained settings. For dying ReLUs try LeakyReLU/PReLU, lower LR, or re-initialize problematic layers. For mobile/quantized models prefer hard-piecewise activations (hard-swish, ReLU6). Monitor activation histograms and gradient norms during training to detect issues. Advanced directions & research Learnable activations (PReLU, spline/rational approximations, adaptive piecewise units). NAS/AutoML to search activation shapes. Binary/ternary/spiking activations for energy-efficient and neuromorphic computing. Theoretical work on activation-dependent optimization landscapes, NTK, and dynamical isometry. Common pitfalls & debugging tips Training stalls/diverges: check learning rate, initialization, activation variance (exploding/vanishing). Many zeros in ReLU layers: may indicate dying ReLU — try LeakyReLU/PReLU or adjust LR. Use logits with numerically stable loss implementations (BCEWithLogits, CrossEntropyLoss). Complex activations can overfit small datasets; use regularization or simpler activations if needed. Summary Activation functions are central to expressivity and trainability of neural networks. ReLU remains a robust default; GELU/Swish/Mish often help in large models; hardware constraints favor hard-piecewise forms. Choice depends on architecture, task, initialization, normalization, and deployment constraints, and continues to be an active area of empirical and theoretical research. Selected references McCulloch & Pitts (1943); Rosenblatt (1958) Rumelhart, Hinton & Williams (1986) Glorot & Bengio (2010); He et al. (2015) Clevert et al. (2015) ELU; Klambauer et al. (2017) SELU Hendrycks & Gimpel (GELU); Ramachandran et al. (Swish); Misra (Mish) Poole et al.; Schoenholz et al. (signal propagation / dynamical isometry)

Let the lesson walk with you.

Podcast

What are activation functions in deep learning? podcast

0:00-3:09

Follow the trail that experts already trust.

Resources

Turn quick sparks into lasting recall.

Flashcards

What are activation functions in deep learning? flashcards

15 cards

Question

Click to flip
Answer

Prove the idea before it slips away.

Quizzes

What are activation functions in deep learning? quiz

13 questions

What is the primary reason activation functions are essential in deep neural networks?

Read deeper, connect wider, own the subject.

Deep Article

What are activation functions in deep learning?

Activation functions are a fundamental component of artificial neural networks: they determine how the weighted sum of a neuron's inputs is transformed into its output. Without nonlinear activation functions, a multilayer network would collapse to a linear model regardless of depth. Activation functions shape the network's representational power, optimization behavior, numerical stability, sparsity, and ultimately performance on tasks.

This article is a deep dive into activation functions: history, mathematical properties, common families, theoretical foundations, practical guidance, code examples, current research directions, and future implications.

Contents

  • Introduction and role of activation functions
  • Brief history and evolution
  • Mathematical properties and practical considerations
  • Common activation functions (formulas, derivatives, pros/cons)
  • Output-layer activations (classification/regression)
  • Theoretical foundations (universal approximation, gradients, signal propagation)
  • Practical guidance and best practices
  • Implementation examples (PyTorch / TensorFlow snippets)
  • Advanced/novel activations and research directions
  • Future implications
  • Summary
  • Key references and further reading

Introduction and role of activation functions

A neuron computes a weighted sum of inputs plus a bias: `` z = w·x + b ` The activation function φ transforms z into the neuron's output: ` a = φ(z) `` Key roles:

  • Introduce nonlinearity so networks can approximate complex functions (universal approximation).
  • Shape gradient flow during training (impacting vanishing/exploding gradients).
  • Influence sparsity and information propagation.
  • Affect convergence speed and generalization.

Without nonlinear φ, stacking layers yields another linear transformation: `` φ(z) = z => linear network `` so depth adds no representational power.


Brief history and evolution

  • 1943: McCulloch and Pitts proposed binary threshold (step) neurons — first formalized artificial neuron.
  • 1958: Rosenblatt's perceptron used step activation; limited to linearly separable problems.
  • 1970s–80s: Sigmoid (logistic) and tanh activations became common; differentiable, enabling gradient-based learning.
  • 1980s: Backpropagation popularized training multi-layer networks with differentiable activations.
  • 1990s–2000s: Sigmoid/tanh used widely, but deeper networks suffered from vanishing gradients.
  • 2010–2012: ReLU (Rectified Linear Unit) surged in popularity (Nair & Hinton 2010; widespread use after AlexNet 2012) because of simplicity and improved gradient flow.
  • 2010s onward: ReLU variants (LeakyReLU, PReLU), ELU, SELU, and later GELU, Swish, Mish, and hard approximations for mobile/hardware efficiency.
  • Present: Activation choice is task-dependent: ReLU default for CNNs, GELU/Swish common in transformers; specialized activations in quantized or spiking networks.

Mathematical properties and practical considerations

When choosing or designing activation functions, consider:

  • Differentiability: required for gradient-based optimization (strict differentiability not strictly necessary — ReLU has subgradient at 0).
  • Saturation: functions that saturate (sigmoid, tanh) have near-zero gradients far from 0, causing vanishing gradients.
  • Zero-centeredness: activations centered around zero (tanh) help balance gradients; positive-only outputs (ReLU, sigmoid) may introduce bias in activations.
  • Monotonicity: monotonic activations simplify optimization analysis; non-monotonic activations (Swish, Mish) can sometimes improve performance.
  • Boundedness: bounded outputs (sigmoid/tanh) limit activation magnitude; unbounded activations (ReLU) can grow large and risk exploding activations if not controlled.
  • Sparsity: ReLU yields exact zeros for negative inputs, encouraging sparse activations and computational efficiency.
  • Smoothness: smooth activations (ELU, softplus, Swish) can improve optimization by providing continuous gradients.
  • Computational cost: simple operations (max, multiplication) are faster and more hardware-friendly than complex functions.
  • Robustness to weight initialization: some activations require careful initialization (SELU has special requirements).
  • Quantization/efficiency: hardware-friendly approximations (hard-swish, hard-sigmoid) are used for mobile networks.

Common activation functions

Below are commonly used activations with formula, derivative, and pros/cons.

Note: φ'(z) denotes derivative wrt z.

1. Linear (identity)

Formula: `` φ(z) = z φ'(z) = 1 `` Use: output layer for regression; not used in hidden layers (would make network linear).

Pros: simple; preserves scale. Cons: no nonlinearity.


2. Step (Heaviside / binary threshold)

Formula: `` φ(z) = 1 if z >= 0 else 0 `` Derivative: zero almost everywhere (not usable for gradient descent).

Use: historical, perceptron. Cons: non-differentiable, unsuitable for gradient-based learning.


3. Sigmoid (logistic)

Formula: `` φ(z) = 1 / (1 + exp(-z)) φ'(z) = φ(z) * (1 - φ(z)) `` Pros: smooth, outputs in (0,1) => interpretable as probability (binary classification). Cons: saturates for large |z| => vanishing gradients; outputs not zero-centered; slower convergence.


4. Tanh (hyperbolic tangent)

Formula: `` φ(z) = tanh(z) = (exp(z) - exp(-z)) / (exp(z) + exp(-z)) φ'(z) = 1 - tanh(z)^2 `` Pros: zero-centered outputs in (-1,1); stronger gradients near origin than sigmoid. Cons: still saturates causing vanishing gradients; slower for deep networks.


5. ReLU (Rectified Linear Unit)

Formula: `` φ(z) = max(0, z) φ'(z) = 1 if z > 0 else 0 (subgradient at 0) `` Pros:

  • Simple and computationally cheap
  • Sparse activations (many zeros)
  • Mitigates vanishing gradient for positive z
  • Works well empirically in deep CNNs/MLPs

Cons:

  • "Dying ReLU": neurons can become permanently inactive if weights push inputs negative
  • Unbounded outputs (can cause large activations)
  • Not differentiable at 0 (handled via subgradient)

6. Leaky ReLU

Formula: `` φ(z) = z if z > 0 else α*z, (α small, e.g., 0.01) φ'(z) = 1 if z > 0 else α `` Pros: avoids dying ReLU by allowing small gradient for negative z. Cons: α fixed; not learned (though PReLU addresses this).


7. Parametric ReLU (PReLU)

Formula: `` φ(z) = z if z > 0 else a*z, with learnable a `` Pros: adaptively learns negative slope; tends to improve performance. Cons: extra parameters; risk of overfitting small models.


8. ELU (Exponential Linear Unit)

Formula: `` φ(z) = z if z >= 0 else α(exp(z) - 1) φ'(z) = 1 if z >= 0 else αexp(z) `` Typical α = 1.

Pros:

  • Negative outputs push mean activations toward zero (helps learning)
  • Smooth for z < 0 (no abrupt slope change like ReLU)

Cons:

  • Slightly more expensive to compute
  • Not self-normalizing; requires careful initialization/BN

9. SELU (Scaled ELU) — Self-Normalizing Neural Networks

Formula: `` φ(z) = λ (z if z > 0 else α(exp(z) - 1)) `` with specific α ≈ 1.6733 and λ ≈ 1.0507.

Pros:

  • Encourages activations to converge to zero mean and unit variance when used with appropriate initialization and architecture (no BatchNorm needed).

Cons:

  • Requires architecture constraints (dense feed-forward, no dropout unless scaled) and specific initialization; less commonly used in conv nets.

10. Softplus

Formula: `` φ(z) = log(1 + exp(z)) (smooth ReLU) φ'(z) = 1 / (1 + exp(-z)) = sigmoid(z) `` Pros: smooth approximation to ReLU, biologically plausible. Cons: more expensive, non-sparse; large z leads to numerical issues if not handled.


11. Softsign

Formula: `` φ(z) = z / (1 + |z|) φ'(z) = 1 / (1 + |z|)^2 `` Less common; smoother, bounded.


12. GELU (Gaussian Error Linear Unit)

Formula (approx): `` φ(z) = z Φ(z) where Φ is standard normal CDF. Common approximation: z 0.5 (1 + tanh(√(2/π) (z + 0.044715 z^3))) `` Used in Transformers (BERT) and modern NLP models.

Pros: Smooth, non-monotonic, empirically better for large models like transformers. Cons: slightly more expensive than ReLU; complex formula.


13. Swish

Formula: `` φ(z) = z * sigmoid(β z) (β often 1; can be trainable) `` Pros: smooth, non-monotonic, often outperforms ReLU on some tasks. Cons: more expensive; benefits depend on architecture.


14. Mish

Formula: `` φ(z) = z tanh(softplus(z)) = z tanh(log(1 + exp(z))) `` Pros: smooth, non-monotonic; reported improvements in some vision tasks. Cons: computational cost; gains context-dependent.


15. Hard approximations (Hard-Sigmoid, Hard-Swish)

Formulas use piecewise-linear approximations for efficiency on mobile devices (used in MobileNetV3). Pros: hardware-friendly, faster; suitable for quantization. Cons: approximate, might slightly reduce accuracy vs. smooth versions.


16. Softmax (for multi-class output)

Formula (for vector z): `` softmax(zi) = exp(zi) / sumj exp(zj) `` Use: final layer for mutually exclusive multi-class classification. Typically combined with cross-entropy loss.

Properties: outputs sum to 1, interpretable as probabilities; gradients combined with cross-entropy have numerically stable forms.


Output-layer activations: choosing based on task

  • Binary classification (single output): Sigmoid + binary cross-entropy (BCE). For multi-label classification with independent labels, use sigmoid on each output.
  • Multi-class, mutually exclusive classification: Softmax + categorical cross-entropy.
  • Regression: Linear output (identity). For bounded target ranges, one can use tanh scaled appropriately.
  • Probabilistic outputs for ordinal/structured tasks may use more specialized final transforms.

Note: In frameworks, use numerically stable combined loss functions (e.g., TensorFlow's tf.nn.sigmoidcrossentropywith...

Ready to see the full tree?

Clone the preview to open the complete learning structure, practice tools, and generated study materials.