Reinforcement learning

Apr 29, 2026··

12 min read

Reinforcement Learning — A Deep Dive

Table of contents

Overview
Historical context and milestones
Core concepts and problem formulation
- Markov Decision Processes (MDPs)
- Policies, returns, and objective
- Value functions and Bellman equations
- Optimality and solution concepts
Foundational algorithms
- Dynamic programming (Value Iteration, Policy Iteration)
- Model-free RL (Monte Carlo, Temporal-Difference, SARSA, Q-learning)
- Policy search and policy gradient methods
- Actor-Critic methods
- Function approximation and Deep RL
Practical algorithmic recipes
- Q-learning pseudocode and Python example (tabular)
- DQN core loop and loss
- PPO high-level update
Theoretical foundations
- Convergence results (DP, TD, Q-learning)
- Policy gradient theorem
- Sample complexity and PAC bounds (high level)
- Off-policy vs on-policy distinctions
Challenges, pitfalls, and practical considerations
- Exploration-exploitation tradeoff
- Stability, catastrophic forgetting, and divergence
- Sample efficiency and data reuse
- Reward design and specification gaming
- Reproducibility and evaluation
Extensions and subfields
- Model-based RL
- Offline (batch) RL
- Hierarchical and options frameworks
- Multi-agent RL
- Inverse RL and imitation learning
- Partial observability and POMDPs
Applications and benchmarks
- Games (Atari, Go, Chess)
- Robotics and control (MuJoCo, real robots)
- Recommendation systems and ads
- Autonomous vehicles
- Healthcare, finance, and industrial control
Current state of the art and trends
- Deep RL successes and limitations
- Offline RL, meta-RL, and sim2real
- RL + language models (RLHF)
Future directions and open problems
- Safety, interpretability, and verification
- Sample-efficient and provably efficient methods
- Causal and generalizable RL
- Hardware/software co-design
Practical checklist and best practices
Selected references and further reading

Overview

Reinforcement learning (RL) is a subfield of machine learning concerned with how agents ought to take actions in an environment to maximize cumulative reward. Unlike supervised learning, RL focuses on sequential decision making where actions influence future states and rewards. RL blends dynamic programming, control theory, Monte Carlo methods, and, in modern times, deep learning to handle high-dimensional function approximation.

Historical context and milestones

1950s–1970s: Early ideas in optimal control and dynamic programming (Bellman).
1980s: Temporal-difference learning (Sutton), credit assignment, and initial theoretical work.
1992: Q-learning (Watkins & Dayan), a breakthrough in model-free off-policy learning.
1995–2000s: Policy gradient methods (Williams' REINFORCE), actor-critic methods, integration with function approximation.
2013–2015: Deep Q-Network (DQN) (Mnih et al.), combining deep neural networks with Q-learning for Atari — ushered in Deep RL era.
2016–2018: AlphaGo/AlphaZero (Silver et al.) and continuous control breakthroughs (DDPG, TRPO, PPO).
2018–present: Widespread application, expansion into offline RL, meta-RL, multi-agent RL, and RL for language models (e.g., RLHF).

Core concepts and problem formulation

Markov Decision Processes (MDPs)

An MDP is the standard mathematical framework for RL. It is a tuple (S, A, P, R, γ) where:

S: set of states (finite or infinite).
A: set of actions.
P(s' | s, a): transition probability kernel.
R(s, a) or R(s, a, s'): reward function (expected immediate reward).
γ ∈ [0,1): discount factor.

The Markov property: the next state distribution depends only on the current state and action, not on the history.

Policies, returns, and objective

A policy π(a | s) defines a (possibly stochastic) action distribution given a state. The return is the cumulative discounted reward: G_t = ∑{k=0}^∞ γ^k R{t+k+1}.

The RL objective is to find a policy π* that maximizes the expected return from a start state (or under a state distribution): J(π) = E_{π}[G_0].

Value functions and Bellman equations

State-value: V^π(s) = E_π[G_t | S_t = s]
Action-value (Q-function): Q^π(s,a) = E_π[G_t | S_t = s, A_t = a]

Bellman expectation equations: V^π(s) = E_{a∼π(.|s), s'∼P}[R(s,a) + γ V^π(s')] Q^π(s,a) = E_{s'∼P}[R(s,a) + γ E_{a'∼π(.|s')}[Q^π(s',a')]]

Bellman optimality equations: V*(s) = max_a E_{s'∼P}[R(s,a) + γ V*(s')] Q*(s,a) = E_{s'∼P}[R(s,a) + γ max_{a'} Q*(s',a')]

Optimal policy: π*(s) = argmax_a Q*(s,a).

Foundational algorithms

Dynamic programming (requires known model)

Value Iteration: iterate Bellman optimality update.
Policy Iteration: alternates policy evaluation and policy improvement.

Both converge to optimal value/policy in finite MDPs.

Model-free RL

Monte Carlo (MC): learn value estimates from complete returns (episodic).
Temporal-Difference (TD(0)): bootstrapping update V(s) ← V(s) + α [r + γ V(s') − V(s)].
SARSA (on-policy TD): Q(s,a) ← Q(s,a) + α [r + γ Q(s',a') − Q(s,a)].
Q-learning (off-policy): Q(s,a) ← Q(s,a) + α [r + γ max_{a'} Q(s',a') − Q(s,a)].

Policy search and policy gradient methods

Directly parameterize a policy π_θ(a|s) and optimize J(θ) via gradients. Key result: policy gradient theorem:

∇θ J(θ) = E{π_θ} [∇_θ log π_θ(a|s) Q^{π_θ}(s,a)].

Practical estimators use advantage functions A(s,a) = Q(s,a) − V(s) to reduce variance. REINFORCE (Williams) is a Monte Carlo policy gradient estimator.

Actor-Critic methods

Combine policy (actor) and value function (critic). The critic estimates value or advantage; the actor is updated via policy gradient using the critic’s estimate. Examples: A2C/A3C, DDPG, SAC.

Function approximation and Deep RL

When state/action spaces are large, neural networks approximate value, policy, or model. Deep Q-Network (DQN) uses a neural net Q(s,a; θ) and stabilizes learning via experience replay and target networks. Continuous-action methods include DDPG, TD3, SAC, and policy gradient methods like PPO and TRPO.

Practical algorithmic recipes

Value Iteration (pseudocode)

Plain Text

Initialize V(s) arbitrarily for all s
loop until convergence:
  for each s in S:
    V(s) := max_a sum_{s'} P(s'|s,a) [ R(s,a,s') + γ V(s') ]

Q-learning (tabular) — pseudocode

Plain Text

Initialize Q(s,a) arbitrarily
for each episode:
  initialize s
  while s not terminal:
    choose a from s using policy derived from Q (e.g., ε-greedy)
    take action a, observe r, s'
    Q(s,a) := Q(s,a) + α [ r + γ max_{a'} Q(s',a') − Q(s,a) ]
    s := s'

Simple tabular Q-learning example (Python / numpy) for FrozenLake-like environment

Python

import numpy as np
# states = nS, actions = nA, P and R assumed accessible via env
Q = np.zeros((nS, nA))
alpha = 0.1; gamma = 0.99; eps = 0.1
for ep in range(10000):
    s = env.reset()
    done = False
    while not done:
        if np.random.rand() < eps:
            a = np.random.randint(nA)
        else:
            a = np.argmax(Q[s])
        s2, r, done, _ = env.step(a)
        Q[s,a] += alpha * (r + gamma * np.max(Q[s2]) - Q[s,a])
        s = s2

DQN core components and loss

Q-network Q(s,a; θ)
Target network Q(s,a; θ−)
Replay buffer of transitions (s,a,r,s',done)
Minibatch training with loss: L(θ) = E_{(s,a,r,s',d) ∼ D} [ (r + γ (1−d) max_{a'} Q(s', a'; θ−) − Q(s,a; θ))^2 ]

High-level DQN loop (pseudocode)

Plain Text

initialize Q-network weights θ and target weights θ− = θ
initialize replay buffer D
for each step:
  select action ε-greedy from Q(s;θ)
  execute action, store transition in D
  if ready:
    sample minibatch from D
    compute target y = r + γ (1−done) max_a' Q(s',a';θ−)
    update θ by minimizing (Q(s,a;θ) − y)^2
  periodically update θ− ← θ

PPO (high-level)

Use clipped surrogate objective to stabilize policy updates: L^{CLIP}(θ) = E [ min( r(θ) * Â , clip(r(θ), 1−ε, 1+ε) * Â ) ] where r(θ) = π_θ(a|s) / π_{θ_old}(a|s).

Theoretical foundations

Convergence results

Dynamic programming methods converge to optimal solution for finite MDPs.
TD(0) converges to fixed point under linear function approximation and on-policy visits, with step-size satisfying Robbins-Monro conditions.
Q-learning converges to Q* with probability 1 for finite MDPs given appropriate learning rates and sufficient exploration (GLIE) — Watkins & Dayan.
With non-linear function approximation (neural nets), convergence guarantees are weak; methods rely on empirical stabilization tricks.

Policy gradient theorem Gives a principled expression for gradient of expected return. Enables stochastic gradient ascent using sample estimates. Variance reduction via baselines (value function) is standard.

Sample complexity and PAC bounds (high level) There are PAC-MDP results quantifying sample complexity required for near-optimal policies (e.g., R-MAX, E3). Recent work focuses on improving sample efficiency and deriving lower/upper bounds.

Off-policy vs on-policy distinctions

On-policy: data generated by current policy (e.g., SARSA, PPO). Typically stable but sample-inefficient.
Off-policy: uses data from other policies (e.g., Q-learning, DQN). More sample-efficient but potentially unstable with function approximation.

Challenges, pitfalls, and practical considerations

Exploration vs exploitation

Simple methods: ε-greedy, Boltzmann (softmax) exploration.
Principled methods: Upper Confidence Bound (UCB), Thompson Sampling (bandits), intrinsic motivation / curiosity (count-based, predictive models), optimism in the face of uncertainty.
Exploration is especially hard in sparse-reward or long-horizon tasks.

Stability and divergence

Bootstrapping + function approximation + off-policy learning can diverge (the deadly triad).
Practical remedies: target networks, replay buffers, gradient clipping, normalization, conservative updates (TRPO/PPO), double Q-learning, ensembles (e.g., bootstrapped DQN).

Sample efficiency

Model-based methods and offline RL attempt to reduce interactions.
Prioritized experience replay, demonstrations, and transfer/meta-learning can help.

Reward design and specification gaming

Poorly specified rewards lead to unintended behavior. Reward shaping, inverse RL, and human-in-the-loop can help.

Reproducibility and evaluation

Sensitivity to hyperparameters, random seeds, compute, and implementation details.
Use multiple seeds, standardized benchmarks, open-source implementations.

Extensions and subfields

Model-based RL

Learn or use a model P̂ and R̂ to plan (Dyna, MPC, PETS).
Model-based methods can be more sample-efficient but model bias is a challenge.

Offline (batch) RL

Learn policies from pre-collected datasets without environment interaction.
Challenges: distributional shift, overestimation of unseen actions. Algorithms: BCQ, CQL, Conservative Q-learning.

Hierarchical Reinforcement Learning

Temporal abstraction via options, skills, subgoals. Aims to improve exploration and scalability.

Multi-agent RL

Multiple agents with possibly competing or cooperative goals.
Non-stationarity, scalability, and equilibrium concepts are central.

Inverse RL and Imitation Learning

Learn a reward function or policy from demonstrations (MaxEnt IRL, GAIL).

Partial Observability and POMDPs

When states are not fully observed, methods use belief states or recurrent policies (RNNs).

Applications and benchmarks

Benchmarks

OpenAI Gym (classic control, Atari)
Arcade Learning Environment (Atari)
DeepMind Control Suite, MuJoCo (continuous control)
ProcGen, OpenAI Retro, RoboSuite, Habitat
Datasets for offline RL (D4RL)

Applications

Games: Atari (DQN), Go/Chess/Shogi (AlphaGo/AlphaZero), StarCraft (AlphaStar), MuZero (model-based latent dynamics).
Robotics: manipulation, locomotion, grasping, sim2real transfer.
Recommender systems: optimize long-term engagement and sequential recommendations.
Autonomous vehicles: decision making and control.
Healthcare: treatment policies, personalized medicine (cautious, requires safety).
Finance: trading strategies (high risk), portfolio optimization.
Industrial control and resource management.

Current state of the art and trends

Achievements

Human-level performance in many games.
Continuous control in high-dimensional spaces.
Integration of learning and search (MuZero), scalable self-play (AlphaZero).

Limitations

Sample inefficiency in many real-world tasks.
Generalization and transfer remain open problems.
Safety, interpretability, and formal guarantees are underdeveloped.

Hot topics

Offline RL: learning from logged data safely.
Model-based RL + deep learning: latent dynamics, probabilistic models.
Meta-RL: learning to quickly adapt to new tasks.
Sim2real and domain randomization.
RL for LLM alignment (RLHF), reward modeling for language tasks.
Causal RL: interventions and counterfactual reasoning to improve generalization.

Future directions and open problems

Provably efficient, sample-efficient methods for large state spaces.
Safe and verifiable RL with formal guarantees.
Combining causal inference with RL for robust generalization.
Multi-modal RL: integration with vision, language, and other modalities.
Continual and lifelong RL: learning many tasks without catastrophic forgetting.
Hardware-aware RL: accelerators and real-time control.
Societal impacts: governance frameworks, fairness, and misuse prevention.

Practical checklist and best practices

Start with simple baselines (random, heuristic, tabular if possible).
Use standardized environments and multiple random seeds.
Monitor learning curves, value estimates, and action distributions.
Stabilize training: target networks, replay buffers, normalize inputs/advantages.
Tune learning rate, discount γ, batch size, replay buffer size, and exploration schedule.
Use domain knowledge for reward shaping sparingly; prefer dense and informative rewards when possible.
When transferring sim → real, use domain randomization and robust policies.
For offline RL, be conservative w.r.t. unseen actions or use behavior cloning initializations.

Selected references and further reading

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). (classic textbook)
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning (REINFORCE).
Mnih, V., et al. (2015). Human-level control through deep reinforcement learning (DQN).
Silver, D., et al. (2016–2018). AlphaGo, AlphaZero, MuZero papers.
Schulman, J., et al. (2015–2017). TRPO, PPO.
Lillicrap, T., et al. (2015). DDPG.
Haarnoja, T., et al. (2018). Soft Actor-Critic (SAC).
Levine, S. (2018–2020). Various works on RL for robotics and control.

Appendix — More detailed equations and pseudo/real code examples

Bellman optimality (action-value) Q*(s,a) = E_{s'}[ R(s,a) + γ max_{a'} Q*(s', a') ]

Policy gradient (episodic form) ∇θ J(θ) ≈ E [ ∑{t=0}^{T} ∇_θ log π_θ(a_t|s_t) (G_t − b(s_t)) ] where b(s_t) is a baseline (commonly V̂(s_t)).

Actor-Critic update sketch

Critic update: minimize L(φ) = E [(r + γ V_φ(s') − V_φ(s))^2].
Actor update: maximize J(θ) ≈ E [ log π_θ(a|s) * Â(s,a) ], with Â computed from critic.

DQN — PyTorch-esque skeleton (very high level)

Python

for step in range(total_steps):
    a = policy.act_epsilon_greedy(s)
    s2, r, done = env.step(a)
    replay_buffer.add(s,a,r,s2,done)
    s = s2 if not done else env.reset()

    if len(replay_buffer) > batch_size:
        s_b,a_b,r_b,s2_b,d_b = replay_buffer.sample(batch_size)
        q_pred = q_net(s_b).gather(1, a_b)
        with torch.no_grad():
            q_target_next = target_net(s2_b).max(dim=1)[0].unsqueeze(1)
            y = r_b + gamma * (1-d_b) * q_target_next
        loss = F.mse_loss(q_pred, y)
        optimizer.zero_grad(); loss.backward(); optimizer.step()
    if step % target_update == 0:
        target_net.load_state_dict(q_net.state_dict())

Conclusion

Reinforcement learning is a rich field combining theory and empiricism. Major successes have demonstrated its potential, but many foundational and practical challenges remain. Researchers and practitioners must balance algorithmic sophistication with careful engineering, principled evaluation, and safety considerations. The field continues to evolve rapidly, with intersections across model-based learning, representation learning, causality, and large-scale systems shaping the next generation of RL capabilities.

If you’d like, I can:

Provide runnable code for a specific algorithm (tabular or deep) and environment (e.g., OpenAI Gym).
Dive deeper into a particular subtopic: model-based RL, offline RL, policy gradient theory, or hierarchical RL.
Produce a reading list with annotated summaries for the most important papers.