What is reinforcement learning?

May 9, 2026··

15 min read

What is Reinforcement Learning?

Reinforcement Learning (RL) is a subfield of machine learning concerned with how agents ought to take actions in an environment so as to maximize cumulative reward. Unlike supervised learning, which learns from labeled examples, RL learns from interaction: an agent observes states, takes actions, receives scalar rewards, and updates behavior to improve long-term performance. RL combines ideas from dynamic programming, optimal control, psychology (trial-and-error learning), and statistics.

This article is a comprehensive deep dive into reinforcement learning: history, core concepts, mathematical foundations, algorithms, practical considerations, applications, current state-of-the-art, open challenges, and future directions. Code snippets and pseudocode illustrate implementation patterns.

Table of contents

High-level intuition
Historical background
Problem formulation: Markov Decision Processes
Core concepts and components
- States, actions, rewards, policies
- Return, discounting
- Value functions
- Bellman equations
- Optimality
Classic algorithmic families
- Dynamic programming
- Monte Carlo methods
- Temporal Difference (TD) learning
- Q-learning and SARSA
- Policy gradient methods
- Actor–critic methods
Deep Reinforcement Learning (Deep RL)
- DQN, improvements, and stability tricks
- Continuous control: DDPG, TD3, SAC
- Policy optimization: TRPO, PPO
Model-free vs model-based RL
On-policy vs off-policy; sample complexity
Exploration vs exploitation
Function approximation and the "deadly triad"
Multi-agent, hierarchical, inverse RL
Practical engineering practices
- Replay buffers, target networks, normalization
- Reward shaping and curriculum learning
- Sim-to-real transfer and domain randomization
- Evaluation and benchmarks
Applications and examples
- Games, robotics, recommender systems, healthcare, finance, and more
Current state of research
Open challenges and future directions
Example implementations (Q-learning; DQN sketch)
Recommended resources and further reading

High-level intuition

Imagine teaching a robot to walk. You cannot provide direct supervision mapping sensor inputs to optimal torques. Instead, you let the robot try actions; when it falls, give a negative reward; when it walks, give a positive reward. Over many trials, the robot learns which actions in which states lead to better cumulative outcomes.

Reinforcement learning formalizes this process: agents learn to maximize expected cumulative reward—often under uncertainty, partial observability, and delayed consequences.

Historical background

Early roots: optimal control and dynamic programming (Richard Bellman, 1950s).
Behavioral psychology: trial-and-error learning, reinforcement in animal learning.
Temporal difference learning: algorithms like TD(λ) (Sutton, 1988).
Q-learning: off-policy algorithm by Watkins (1989).
Policy gradient methods: REINFORCE (Williams, 1992).
The "deep RL" revolution: Deep Q-Networks (DQN) by Mnih et al. (2013/2015) used deep neural nets to approximate value functions, achieving human-level play on many Atari games.
A burst of progress since 2013: actor-critic deep methods (A3C, PPO), continuous control algorithms (DDPG, SAC, TD3), large-scale RL for games (AlphaGo, AlphaZero, MuZero) and simulation-to-real robotics advances.

Primary influential textbook: Sutton & Barto, "Reinforcement Learning: An Introduction".

Problem formulation: Markov Decision Processes (MDPs)

Reinforcement learning tasks are commonly modeled as Markov Decision Processes (MDPs):

An MDP is a tuple (S, A, P, R, γ) where:

S: set of states
A: set of actions (can be discrete or continuous)
P(s' | s, a): transition probability—probability of next state s' given current state s and action a
R(s, a, s'): reward function (or R(s, a)) giving scalar reward
γ ∈ [0,1): discount factor for future rewards

Objective: find a policy π(a | s) that maximizes expected discounted return: G_t = sum_{k=0}^∞ γ^k R_{t+k+1}

Expected return from state s following policy π: V^π(s) = Eπ[ G_t | S_t = s ]

Action-value function: Q^π(s, a) = Eπ[ G_t | S_t = s, A_t = a ]

The problem of RL is solving for an optimal policy π* that maximizes V^π(s) for all s; corresponding optimal value functions V*(s) and Q*(s,a).

Core concepts and components

States, actions, rewards, and policies

State (s): representation of an environment at a time.
Action (a): choice available to agent.
Reward (r): scalar feedback; the only learning signal.
Policy (π): mapping from states to actions (deterministic or stochastic).

Return and discounting

Return G_t = ∑{k=0}^∞ γ^k r{t+k+1}.
Discount factor γ balances present vs future rewards. γ close to 1 values long-term reward.

Value functions

State-value V^π(s): expected return from state s under policy π.
Action-value Q^π(s, a): expected return after taking action a in state s under π.

Bellman expectation equations

The Bellman expectation equations provide recursive decompositions: V^π(s) = E_{a∼π(s), s'∼P}[ r(s,a,s') + γ V^π(s') ] Q^π(s,a) = E_{s'∼P}[ r(s,a,s') + γ E_{a'∼π(s')}[ Q^π(s',a') ] ]

Bellman optimality equations

For optimal V* and Q*: V*(s) = max_a E_{s'∼P}[ r(s,a,s') + γ V*(s') ] Q*(s,a) = E_{s'∼P}[ r(s,a,s') + γ max_{a'} Q*(s',a') ]

Solving these equations is central to many RL algorithms.

Classic algorithmic families

Dynamic Programming (DP)

Requires full model (P and R) and solves Bellman equations via iterative updates: policy evaluation (compute V^π) and policy improvement (make policy greedy wrt V).
Examples: value iteration, policy iteration.
Converges with guarantees for finite MDPs, but impractical for large or unknown models.

Monte Carlo (MC) methods

Learn from episodic experience, estimating returns by averaging actual returns following visits to states/actions.
No bootstrapping: updates use complete returns.
Useful when model unknown and episodes finite.

Temporal Difference (TD) learning

Combines ideas from DP and MC: updates using bootstrapped estimate (one-step lookahead).
TD(0) update: V(s) ← V(s) + α [r + γ V(s') − V(s)]
TD(λ): uses eligibility traces to trade bias/variance via parameter λ ∈ [0,1].

Q-Learning and SARSA

Q-Learning (Watkins, 1989): off-policy action-value method that learns Q* without model. Q(s,a) ← Q(s,a) + α [r + γ max_{a'} Q(s',a') − Q(s,a)] Converges under conditions (finite state/action, sufficient exploration, decaying α).
SARSA: on-policy counterpart. Q(s,a) ← Q(s,a) + α [r + γ Q(s',a') − Q(s,a)] where a' is actually taken.

Policy Gradient methods

Directly parameterize policy π_θ(a|s) and increase expected return J(θ) by gradient ascent.
REINFORCE (Monte Carlo policy gradient): ∇_θ J(θ) ≈ E[∑_t ∇_θ log π_θ(a_t|s_t) G_t]
Use baseline functions to reduce variance (e.g., subtract V(s_t)).
Better for continuous action spaces and stochastic policies.

Actor–Critic methods

Combine value function (critic) with policy (actor).
Critic estimates V^π or Q^π; actor updates policy parameters using critic's signals (usually the advantage).
Can be on-policy (A2C, A3C, PPO) or off-policy (DDPG, SAC with critics).

Deep Reinforcement Learning (Deep RL)

Deep RL uses neural networks as function approximators for policies and/or value functions. This enables scaling to high-dimensional inputs (images, LIDAR).

Key milestones and algorithms:

DQN (Mnih et al., 2015): used convolutional networks to learn Q-values from pixels on Atari games. Stability tricks used: experience replay, target networks, reward clipping, gradient clipping.
Improvements to DQN: Double DQN, Dueling DQN, Prioritized Experience Replay, Rainbow (combines multiple improvements).
Policy optimization: A3C/A2C (asynchronous/advantage actor-critic), PPO (Proximal Policy Optimization) for stable policy updates.
Continuous control algorithms: DDPG (Deterministic Policy Gradient), TD3 (Twin Delayed DDPG), SAC (Soft Actor-Critic) which adds entropy regularization for exploration and stability.
Model-based deep RL: learning dynamics models and planning (e.g., MBPO).
Learned planning methods: MuZero learns a model for planning in latent space.

Challenges in deep RL: instability due to bootstrapping + function approximation + off-policy updates (the “deadly triad”), sample inefficiency, and high variance gradients.

Model-free vs model-based RL

Model-free RL: learns policy and/or value without explicit dynamics model (e.g., Q-learning, policy gradient). Simpler but often sample-inefficient.
Model-based RL: learns a model P̂(s'|s,a) and uses planning or imagination to improve policy. Can be more sample-efficient but sensitive to model errors. Hybrid approaches mix both (e.g., Dyna, MBPO).

Advantages of model-based: rapid learning, sample efficiency. Disadvantages: model bias, complexity of planning.

On-policy vs Off-policy; Sample Complexity

On-policy methods (e.g., REINFORCE, PPO, A2C): data collected by the current policy used to update itself. Safer/stabler but less sample-efficient because they cannot reuse old data as effectively.
Off-policy methods (e.g., Q-learning, DQN, DDPG, SAC): can learn from data collected with different policies (e.g., stored in replay buffers), enabling reuse of samples and better sample efficiency.

Sample complexity is a critical issue: many deep RL methods require millions of environment interactions. Offline (batch) RL addresses learning solely from pre-collected datasets.

Exploration vs exploitation

A core RL dilemma: choose between:

Exploitation: choose known best action to maximize immediate reward.
Exploration: try uncertain actions to discover potentially better long-term rewards.

Common exploration strategies:

ε-greedy (discrete actions)
Softmax / Boltzmann sampling
Action-noise (continuous actions): Gaussian, Ornstein-Uhlenbeck
Entropy regularization (policy gradient / SAC)
Intrinsic motivation / curiosity: reward internal novelty signals (prediction error, state-visitation counts, pseudo-counts)
Posterior sampling (Thompson sampling); Bayesian RL

Sparse rewards are particularly challenging; intrinsic rewards and shaped rewards are common remedies.

Function approximation and the "deadly triad"

When using function approximators (neural nets) with bootstrapping and off-policy updates, instability or divergence can occur. This combination is called the deadly triad:

Function approximation
Bootstrapping (using current estimates to update)
Off-policy learning

DQN's engineering additions (replay buffer, target networks) were designed to mitigate these instabilities.

Multi-agent, hierarchical, inverse RL

Multi-agent RL (MARL): multiple agents learning concurrently; interplay of cooperation, competition, communication. Challenges include non-stationarity and scalability.
Hierarchical RL: decompose tasks into temporally-extended subgoals or options (Options framework). Aims to improve exploration and transfer by learning reusable skills.
Inverse Reinforcement Learning (IRL) and Imitation Learning: infer reward functions from expert demonstrations; learn policies from demonstrations without explicit reward signals (Behavioral Cloning, GAIL).

Practical engineering practices

Key tricks and techniques widely used in deep RL systems:

Replay buffers: store past transitions and sample mini-batches uniformly or prioritized. Helps decorrelate data and reuse samples (crucial for off-policy).
Target networks: maintain a slowly-updated copy of network for stable bootstrapped targets (DQN).
Normalization: normalize observations and rewards; use running mean/variance.
Gradient clipping: prevents exploding gradients.
Reward shaping: modify rewards to accelerate learning while preserving optimal policy (beware of unintended behaviors).
Curriculum learning: start with easy tasks and gradually increase difficulty.
Domain randomization: randomize simulation parameters during training to improve real-world transfer (sim-to-real).
Safety and constraints: avoid catastrophic actions (e.g., human-in-the-loop; constrained RL techniques).
Evaluation: use held-out environments and deterministic evaluation; measure sample efficiency, stability, and generalization.

Benchmarks:

Arcade games: Atari 2600 (ALE)
Continuous control: MuJoCo, DeepMind Control Suite (DMControl)
Procedural generalization: Procgen
Robotics: Gym Robotics, RoboSuite
Board games and perfect-information games: Chess, Go frameworks; specialized environments for complex tasks.

Libraries and frameworks:

OpenAI Gym, Gymnasium
Stable Baselines3
RLlib (Ray)
Dopamine
CleanRL
Acme (DeepMind)
PettingZoo (multi-agent)
TF-Agents, Tianshou

Applications and examples

Games: Atari, Go, Chess, StarCraft II, Dota 2. RL has achieved superhuman performance in many domains.
Robotics: locomotion, manipulation, grasping; real-world robot control using RL with sim-to-real transfer.
Autonomous driving: decision-making, route planning, control stacks.
Recommender systems and ads: sequential decision making to maximize long-term user engagement or revenue.
Finance: portfolio optimization, algorithmic trading (subject to risk and non-stationarity).
Healthcare: treatment planning, personalized interventions (must consider safety, interpretability, and ethics).
Energy systems: grid management, demand-response control.
Natural language processing: dialogue systems, text generation via RL from human feedback (RLHF) is used to align language models.

Examples:

A robot learns to pick objects via RL with dense shaping rewards or via imitation learning plus RL fine-tuning.
AlphaZero uses self-play RL with MCTS to achieve superhuman board-game performance.
RLHF (reinforcement learning from human feedback) aligns large language models to human preferences (used in ChatGPT-like systems).

Current state of research

Continued improvements in scalable RL algorithms (PPO, SAC) and integration of model-based components (MuZero, MBPO).
RLHF is central to aligning language models.
Offline RL: learning from fixed datasets without interaction—practical for domains where interactions are expensive or risky.
Sim-to-real: domain randomization and better dynamics models to transfer policies to real robots.
Generalization and robustness: addressing overfitting to training environments and improving out-of-distribution performance.
Safety and interpretability: constrained RL, risk-sensitive objectives, and explainable policies.
Combining RL with other paradigms: supervised pretraining, representation learning, contrastive learning, and causality.

Notable high-impact demonstrations: AlphaGo/AlphaZero, MuZero, OpenAI Five (Dota), DeepMind's StarCraft agents, RLHF for language models.

Open challenges and future directions

Sample efficiency: reducing required environment interactions, especially in real-world tasks.
Safety and reliability: ensuring policies respect constraints and avoid catastrophic failures.
Interpretability: understanding why policies make certain decisions.
Transfer and continual learning: reuse skills across tasks and adapt online without catastrophic forgetting.
Exploration in complex/high-dimensional spaces with sparse rewards.
Scaling to real-world multi-modal tasks with partial observability, long horizons, and multi-agent interactions.
Theoretical understanding: convergence rates, generalization bounds for function approximation, and non-stationary settings.
Integration with causality and structured world models to improve generalization and reasoning.
Human-aware RL: integrating human preferences, fairness, and notions of social acceptability.

Example implementations

Below are concise examples: a tabular Q-learning implementation for a simple gridworld, and a PyTorch sketch of a DQN agent for discrete control.

Note: these examples are illustrative; production code and full training loops require additional infrastructure.

1) Tabular Q-learning pseudocode / minimal Python

Pseudocode:

Initialize Q(s,a) arbitrary (e.g., zeros)
For each episode:
- Initialize s
- Repeat until terminal:
  - Choose a using ε-greedy from Q(s,·)
  - Execute a; observe r, s'
  - Q(s,a) ← Q(s,a) + α [r + γ max_{a'} Q(s',a') − Q(s,a)]
  - s ← s'

Minimal Python illustration (conceptual):

Python

import random
import numpy as np

# Hypothetical environment that follows OpenAI Gym API
# env.reset() -> state
# env.step(action) -> (next_state, reward, done, info)
env = ...  

n_states = ...   # discrete states mapped to indices
n_actions = env.action_space.n

Q = np.zeros((n_states, n_actions))
alpha = 0.1
gamma = 0.99
epsilon = 0.1
n_episodes = 10000

for ep in range(n_episodes):
    state = env.reset()
    done = False
    while not done:
        if random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = int(np.argmax(Q[state]))
        next_state, reward, done, _ = env.step(action)
        best_next = np.max(Q[next_state])
        Q[state, action] += alpha * (reward + gamma * best_next - Q[state, action])
        state = next_state

This works for small discrete MDPs.

2) DQN sketch (PyTorch) — high level

Key ingredients: neural network Q(s,a; θ), replay buffer, target network θ−, ε-greedy, training loop.

Python

import torch
import torch.nn as nn
import torch.optim as optim
import random
from collections import deque, namedtuple

Transition = namedtuple('Transition', ('s', 'a', 'r', 's2', 'done'))

class QNetwork(nn.Module):
    def __init__(self, obs_dim, act_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, act_dim)
        )
    def forward(self, x):
        return self.net(x)

# Setup
env = ...
obs_dim = env.observation_space.shape[0]
act_dim = env.action_space.n
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

policy_net = QNetwork(obs_dim, act_dim).to(device)
target_net = QNetwork(obs_dim, act_dim).to(device)
target_net.load_state_dict(policy_net.state_dict())
optimizer = optim.Adam(policy_net.parameters(), lr=1e-3)
replay = deque(maxlen=100000)
batch_size = 64
gamma = 0.99
epsilon = 1.0
eps_end, eps_decay = 0.01, 50000
update_target_every = 1000

def select_action(state, step):
    eps = eps_end + (1.0 - eps_end) * np.exp(-1.0 * step / eps_decay)
    if random.random() < eps:
        return env.action_space.sample()
    state_t = torch.FloatTensor(state).unsqueeze(0).to(device)
    q_vals = policy_net(state_t)
    return int(q_vals.argmax().item())

# Training loop (simplified)
step = 0
for episode in range(1000):
    s = env.reset()
    done = False
    while not done:
        a = select_action(s, step)
        s2, r, done, _ = env.step(a)
        replay.append(Transition(s, a, r, s2, done))
        s = s2
        step += 1

        if len(replay) > batch_size:
            batch = random.sample(replay, batch_size)
            s_batch = torch.FloatTensor([t.s for t in batch]).to(device)
            a_batch = torch.LongTensor([t.a for t in batch]).unsqueeze(1).to(device)
            r_batch = torch.FloatTensor([t.r for t in batch]).to(device)
            s2_batch = torch.FloatTensor([t.s2 for t in batch]).to(device)
            done_batch = torch.FloatTensor([t.done for t in batch]).to(device)

            q_values = policy_net(s_batch).gather(1, a_batch).squeeze(1)
            with torch.no_grad():
                q_next = target_net(s2_batch).max(1)[0]
            q_target = r_batch + gamma * q_next * (1 - done_batch)

            loss = nn.functional.mse_loss(q_values, q_target)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        if step % update_target_every == 0:
            target_net.load_state_dict(policy_net.state_dict())

This sketch omits many practical details (reward preprocessing, prioritized replay, double DQN, dueling architecture, gradient clipping, learning rate schedules), but shows the high-level structure.

Evaluation metrics

Cumulative reward / episodic return
Sample complexity: reward vs environment interactions
Stability and repeatability across random seeds (mean ± std)
Asymptotic performance (final return) and wall-clock training time
Generalization: performance on unseen or randomized environments
Safety metrics: constraint violations, risk measures (CVaR)

Conclusion

Reinforcement learning is a powerful, general paradigm for decision-making under uncertainty and delayed feedback. It has matured from tabular dynamic programming to deep learning–driven algorithms capable of solving complex, high-dimensional tasks such as Atari games, Go, and robotic control. Despite impressive progress, RL faces core challenges—sample efficiency, stability, safety, and generalization—keeping it an active and exciting research area.

Whether building an agent for games, robotics, recommendation systems, or model alignment (e.g., RLHF), practitioners must carefully choose algorithmic families, manage exploration-exploitation trade-offs, engineer stable training pipelines, and evaluate agents rigorously across metrics beyond raw reward.

If you’d like, I can:

Provide a full working DQN or PPO implementation for a specific Gym environment.
Walk through solving a concrete RL problem (e.g., CartPole, MountainCar, or a custom gridworld) step-by-step.
Summarize recent research papers in a particular subdomain (offline RL, hierarchical RL, or RL for robotics).

What is Reinforcement Learning?

High-level intuition

Historical background

Problem formulation: Markov Decision Processes (MDPs)

Core concepts and components

States, actions, rewards, and policies

Return and discounting

Value functions

Bellman expectation equations

Bellman optimality equations

Classic algorithmic families

Dynamic Programming (DP)

Monte Carlo (MC) methods

Temporal Difference (TD) learning

Q-Learning and SARSA

Policy Gradient methods

Actor–Critic methods

Deep Reinforcement Learning (Deep RL)

Model-free vs model-based RL

On-policy vs Off-policy; Sample Complexity

Exploration vs exploitation

Function approximation and the "deadly triad"

Multi-agent, hierarchical, inverse RL

Practical engineering practices

Applications and examples

Current state of research

Open challenges and future directions

Example implementations

1) Tabular Q-learning pseudocode / minimal Python

2) DQN sketch (PyTorch) — high level

Evaluation metrics

Recommended resources and further reading

Conclusion