A learning path ready to make your own.

What is a recurrent neural network?

Recurrent Neural Networks (RNNs) — Concise Overview What they are: RNNs are neural networks for sequential data that maintain a time-evolving hidden state so outputs depend on current input and past inputs. They suit tasks where order and context matter (language, speech, time series, control, music, etc.). Motivation and high-level intuition Introduce memory via a hidden state vector that is updated each time step to summarize past inputs. Enable modeling variable-length sequences, capture temporal dependencies, allow online/streaming inference, and reuse parameters across time for efficiency and generalization. Historical highlights 1980s: early feedback/temporal networks (Hopfield, others). 1986–1990s: Jordan and Elman recurrent nets; BPTT formalized. 1997: LSTM introduced (Hochreiter & Schmidhuber) to address vanishing gradients. 2014: GRU proposed (Cho et al.) as a simpler gated unit. 2017–present: Transformers (attention-only) largely supersede RNNs in many NLP tasks, but RNNs remain useful for streaming and resource-constrained settings. Formal definition (essence) For input sequence x = (x1...xT), hidden state ht and optional output yt are updated recurrently, e.g.: Vanilla: ht = f(Whh ht-1 + Wxh xt + bh); yt = g(Why ht + by) Same weights apply across time (parameter sharing). Initial h0 is learned or zero. Core architectures Vanilla RNN (Elman): simple recurrence, effective for short dependencies but prone to vanishing/exploding gradients. Jordan network: uses previous output as context (less common today). LSTM: adds cell state and gates (input, forget, output) to enable long-term gradient flow and store information across many steps. GRU: merges gates and hidden/cell concepts for a lighter alternative often matching LSTM performance. Bidirectional RNNs: run forwards and backwards over a full sequence (useful when entire sequence is available). Stacked (multi-layer) RNNs: increase representational capacity by layering recurrent units. Training RNNs Forward pass: compute ht and yt each step; handle padding/masking for variable lengths. BPTT: unfold across time and backpropagate; full BPTT is expensive—truncated BPTT is commonly used. Vanishing / exploding gradients: repeated multiplications through time can shrink or blow up gradients. Causes: spectral properties of recurrent weights and activation derivatives. Remedies: gated units (LSTM/GRU), gradient clipping (norm clipping), orthogonal/unitary initializations, careful activations, normalization techniques, and truncated BPTT. Practical techniques: teacher forcing (and scheduled sampling), variational dropout, packed sequences for efficiency, Adam/RMSprop optimizers, learning-rate schedules. Common sequence modeling setups Sequence classification: encode sequence with RNN/BiRNN; pool last or aggregate states → classifier. Sequence labeling: per-step predictions (NER, POS); often BiRNN ± CRF for structured outputs. Language modeling / generation: next-token prediction with softmax; evaluated by perplexity, BLEU/ROUGE for generation tasks. Encoder–decoder (seq2seq): encoder compresses input; decoder generates output. Attention (Bahdanau/Luong) mitigates fixed-context bottleneck by aligning decoder steps to encoder states. Implementation notes Frameworks: PyTorch (nn.RNN/LSTM/GRU), TensorFlow/Keras (tf.keras.layers.LSTM/GRU). Production code needs padding, masking, batching, packed sequences, and GPU-optimized RNNs. Use packed sequences in PyTorch for variable lengths; initialize LSTM forget-bias > 0 (e.g., 1) to encourage remembering initially. Practical considerations & best practices Pad and mask sequences; use packed utilities to speed computation. Clip gradient norms (e.g., threshold 5) to avoid exploding gradients. Orthogonal initialization for recurrent weights; Xavier/Glorot for feedforward weights. Dropout: use time-consistent (variational) dropout across steps when appropriate. Choose GRU for efficiency or LSTM for strongest long-range performance; consider BiRNNs when full sequence is available. Monitor gradient norms, training vs validation loss, and task-appropriate metrics (perplexity, BLEU, F1). Applications NLP: language modeling, machine translation (seq2seq+attention), NER, POS tagging, text generation. Speech/audio: ASR, speech synthesis. Time series & forecasting, anomaly detection, music generation, control/robotics, video captioning (as part of pipelines). Current trends Transformers (self-attention) dominate many sequence tasks due to parallelism and better long-range modeling (BERT, GPT, etc.). RNNs remain relevant for streaming/online inference, parameter/compute-efficient models, edge devices, and specialized recurrent research (unitary/orthogonal RNNs, spiking networks). Hybrid models combine convolutional front-ends, recurrent short-term processing, and attention for long-range context. Future directions Efficient recurrent architectures for long/streaming sequences, unitary/orthogonal RNNs, biologically plausible/spiking recurrent models. Continual learning with recurrent memory, interpretability of recurrent states, hardware accelerators for recurrent ops, and hybrids with attention or external differentiable memory. Key references & resources Hochreiter & Schmidhuber (1997) — LSTM Elman (1990) — simple recurrent nets Cho et al. (2014) — GRU / encoder–decoder work Bahdanau et al. (2014) — attention for seq2seq Vaswani et al. (2017) — Transformers Tutorials: PyTorch/TensorFlow official RNN guides; Andrej Karpathy's RNN blog post. Bottom line: RNNs are foundational for sequential data—simple and effective for many tasks but challenged by long-range training issues that gated units mitigate. Transformers have shifted the landscape, yet RNNs keep important roles in streaming, compact models, and specialized research. Understanding RNN dynamics, architectures, and practical training techniques remains valuable for working with sequences.

Open full tree

Follow the trail that experts already trust.

Resources

16:37

What is Recurrent Neural Network (RNN)? Deep Learning Tutorial 33 (Tensorflow, Keras & Python)

codebasics467.1K views

9:51

Illustrated Guide to Recurrent Neural Networks: Understanding the Intuition

The AI Hacker366.9K views

1:01:34

MIT 6.S191 (2025): Recurrent Neural Networks, Transformers, and Attention

Alexander Amini293.0K views

Read deeper, connect wider, own the subject.

Deep Article

What is a Recurrent Neural Network?

A recurrent neural network (RNN) is a class of artificial neural networks designed to process sequential data by maintaining a hidden state that evolves over time. Unlike feedforward networks, RNNs have temporal dynamics: the output at each time step depends not only on the current input but also on previous inputs through the internal state. This makes them well-suited to tasks where order and context matter: language, speech, time series, music, control signals, and more.

This article provides a comprehensive, in-depth overview of RNNs: history, fundamental concepts, mathematical foundations, popular architectures (including LSTM and GRU), training methods, common problems and solutions, practical applications, code examples, current state-of-the-art and future directions, and recommended resources.

Table of contents

High-level intuition and motivation
Historical development
Formal definition and notation
Core RNN architectures
Vanilla RNN (Elman)
Jordan network
Long Short-Term Memory (LSTM)
Gated Recurrent Unit (GRU)
Bidirectional and stacked RNNs
Training RNNs
Forward pass equations
Backpropagation Through Time (BPTT)
Vanishing and exploding gradients
Practical training techniques
Sequence modeling tasks and architectures
Sequence classification and labeling
Sequence generation and language modeling
Encoder–decoder (seq2seq) and attention mechanisms
Implementation examples (PyTorch, TensorFlow/Keras)
Practical considerations and best practices
Applications and examples
Current trends and the role of transformers
Future directions and research topics
Key references and further reading

High-level intuition and motivation

Humans interpret sequential data by remembering context. Consider reading a sentence: each word is interpreted in light of preceding words. Classic feedforward networks lack persistence: they treat inputs independently. RNNs introduce memory via a hidden state vector that "remembers" information from previous time steps. The hidden state acts as a dynamic summary of past inputs and is updated recurrently as new data arrives.

Key motivations:

Modeling sequences (variable-length input and/or output)
Capturing temporal dependencies and context
Enabling online/streaming processing (stateful inference)
Parameter sharing across time steps reduces model size and helps generalization

Historical development

1980s: Early ideas of networks with feedback and temporal dynamics. Attributed pioneers include John Hopfield (associative memories), and early recurrent architectures.
1987–1990s: Elman networks (1990) and Jordan networks (1986) developed for tasks with temporal dependencies.
Elman (1990): Simple recurrent network with context units that store previous hidden activations.
Jordan (1986): Context units fed by previous outputs.
1990s: Backpropagation Through Time (BPTT) formalized to train recurrent nets.
1997: Hochreiter & Schmidhuber introduced Long Short-Term Memory (LSTM), addressing vanishing gradient problems and enabling learning of long-range dependencies.
2014: Cho et al. introduced Gated Recurrent Unit (GRU), a simplified gated alternative to LSTM.
2014–2020s: RNNs (LSTMs/GRUs) became standard in many sequence tasks (speech recognition, machine translation, language modeling).
2017 onwards: Transformers (attention-only architectures) disrupted the field by outperforming RNNs on many tasks. Yet RNNs remain relevant in streaming, compact models, and certain time-series contexts.

Formal definition and notation

Consider an input sequence x = (x1, x2, ..., xT), where xt ∈ R^n. Let ht ∈ R^m denote the hidden state at time t, and yt ∈ R^k denote the output at time t (optional). The recurrent update is typically:

ht = f(Whh ht-1 + Wxh xt + bh)

yt = g(Why ht + by)

Where:

Wxh: weights from input to hidden
Whh: recurrent weights (hidden to hidden)
Why: hidden to output weights
bh, by: bias vectors
f: activation function (tanh, ReLU, etc.)
g: output activation (softmax for classification, linear for regression)

Important aspects:

The same weights are applied at every time step (parameter sharing).
The initial hidden state h0 may be learned or set to zeros.
For variable-length sequences, the recurrence runs up to T.

This simple formulation is the "vanilla" or "Elman" RNN.

Core RNN architectures

1) Vanilla RNN (Elman network)

Update equations:

ht = φ(Wxh xt + Whh ht-1 + bh)
yt = ψ(Why ht + by)

Here φ is usually tanh or ReLU, and ψ depends on the task.

Strengths:

Simple, efficient for short-range dependencies.

Weaknesses:

Training over long sequences often fails due to vanishing or exploding gradients.

2) Jordan network

In a Jordan network, the context (recurrent input) is the previous output rather than previous hidden state. Less common now.

3) Long Short-Term Memory (LSTM)

LSTM introduces memory cell ct and gating mechanisms to allow gradients to flow across many time steps. LSTM's design addresses the vanishing gradient issue through the cell state and multiplicative gates.

Standard LSTM equations (one common variant):

it = σ(Wxi xt + Whi ht-1 + bi) (input gate) ft = σ(Wxf xt + Whf ht-1 + bf) (forget gate) ot = σ(Wxo xt + Who ht-1 + bo) (output gate) g t = tanh(Wxg xt + Whg ht-1 + bg) (cell candidate) ct = ft ⊙ ct-1 + it ⊙ g t (cell state update) ht = ot ⊙ tanh(ct) (hidden state / output)

Where:

σ is the sigmoid function
⊙ is element-wise multiplication
gates constrain information flow, enabling long-term storage

Advantages:

Handles long-range dependencies
Widely used in NLP, speech, time series

4) Gated Recurrent Unit (GRU)

GRU simplifies LSTM by combining gates and merging cell and hidden state:

zt = σ(Wxz xt + Whz ht-1 + bz) (update gate) rt = σ(Wxr xt + Whr ht-1 + br) (reset gate) ht~ = tanh(Wxh xt + Whh (rt ⊙ ht-1) + b) ht = (1 - zt) ⊙ ht-1 + zt ⊙ ht~

GRUs often match LSTMs in performance while being computationally cheaper.

5) Bidirectional and stacked RNNs

Bidirectional RNNs (BiRNN): Process sequence forward and backward and combine states: useful when entire sequence is available (e.g., text tagging).
Stacked (multi-layer) RNNs: Multiple recurrent layers where outputs of one layer feed the next. Improves representational capacity.

Training RNNs

Forward pass

At each time step compute hidden state and output with recurrence equations. For sequences in batches, time-major or batch-major layouts are used; sequences may be padded and masked.

Backpropagation Through Time (BPTT)

BPTT unfolds the RNN across time steps into an equivalent deep feedforward network and applies backpropagation to compute gradients. For T time steps, gradients are backpropagated through T layers.

Key issues:

BPTT across very long sequences is computationally expensive and memory intensive.
Truncated BPTT: Backpropagate gradients for a limited window (e.g., 20–50 steps), trade-off between temporal credit assignment and efficiency.

Vanishing and exploding gradients

As gradients are propagated through many time steps, repeated multiplication by weight matrices and derivatives can lead to:

Vanishing gradients: Gradients exponentially decay, making learning of long-range dependencies difficult.
Exploding gradients: Gradients grow exponentially, causing training instability.

Explanation (qualitatively): gradient ∂L/∂h_t depends on powers of Whh and derivatives of activation. If eigenvalues of Whh are less than 1, gradient decays; if greater than 1, it explodes.

Remedies:

Gated architectures (LSTM/GRU) mitigate vanishing gradients via additive cell updates and gating.
Gradient clipping (e.g., clip norm to threshold) prevents exploding gradients.
Orthogonal or unitary recurrent matrices, specialized RNN variants.
Careful initialization (e.g., orthogonal initialization).
Use of activations less prone to saturating gradients (ReLU with caution).
Layer normalization, batch normalization variants (though batch norm is trickier in recurrent settings).

Practical training techniques

Mini-batching and packing variable-length sequences
Truncated BPTT for long sequences
Teacher forcing for sequence generation (provide ground-truth previous token during training); alternatives: scheduled sampling
Regularization: dropout (variational dropout/time-step consistent dropout), weight decay
Optimizers: Adam, RMSprop, SGD with momentum
Learning rate schedules and warmup

Sequence modeling tasks and architectures

RNNs are versatile for many sequence tasks. Below are common setups and model architectures.

1) Sequence classification

Input: sequence, Output: single label (e.g., sentiment analysis).
Typical architecture: RNN (or BiRNN) encodes sequence; final hidden state pooled (last state or mean/max pooling) → classifier (softmax).
Loss: cross-entropy for classification.

2) Sequence labeling (token-level predictions)

Input: sequence, Output: label per time step (e.g., POS tagging, NER).
Architecture: BiRNN with per-step classifier, or BiRNN + CRF on top for structured outputs.

3) Sequence generation and language modeling

Task: predict next token given previous tokens.
Models: RNN/LSTM/GRU with softmax output across vocabulary.
Training: teacher forcing or variants. Evaluation: perplexity (exp of cross-entropy), BLEU (for translation), accuracy.

4) Encoder–decoder (seq2seq)

Encoder RNN reads an input sequence into a context vector (final hidden state(s)).
Decoder RNN generates output sequence conditioned on context. Without attention, compression to a single vector is limiting for long sequences.
Attention mechanisms (Bahdanau, Luong) resolve bottleneck by allowing decoder to ...

Ready to see the full tree?

Clone the preview to open the complete learning structure, practice tools, and generated study materials.