Reinforcement Learning — A Deep Dive
Table of contents
- Overview
- Historical context and milestones
- Core concepts and problem formulation
- Markov Decision Processes (MDPs)
- Policies, returns, and objective
- Value functions and Bellman equations
- Optimality and solution concepts
- Foundational algorithms
- Dynamic programming (Value Iteration, Policy Iteration)
- Model-free RL (Monte Carlo, Temporal-Difference, SARSA, Q-learning)
- Policy search and policy gradient methods
- Actor-Critic methods
- Function approximation and Deep RL
- Practical algorithmic recipes
- Q-learning pseudocode and Python example (tabular)
- DQN core loop and loss
- PPO high-level update
- Theoretical foundations
- Convergence results (DP, TD, Q-learning)
- Policy gradient theorem
- Sample complexity and PAC bounds (high level)
- Off-policy vs on-policy distinctions
- Challenges, pitfalls, and practical considerations
- Exploration-exploitation tradeoff
- Stability, catastrophic forgetting, and divergence
- Sample efficiency and data reuse
- Reward design and specification gaming
- Reproducibility and evaluation
- Extensions and subfields
- Model-based RL
- Offline (batch) RL
- Hierarchical and options frameworks
- Multi-agent RL
- Inverse RL and imitation learning
- Partial observability and POMDPs
- Applications and benchmarks
- Games (Atari, Go, Chess)
- Robotics and control (MuJoCo, real robots)
- Recommendation systems and ads
- Autonomous vehicles
- Healthcare, finance, and industrial control
- Current state of the art and trends
- Deep RL successes and limitations
- Offline RL, meta-RL, and sim2real
- RL + language models (RLHF)
- Future directions and open problems
- Safety, interpretability, and verification
- Sample-efficient and provably efficient methods
- Causal and generalizable RL
- Hardware/software co-design
- Practical checklist and best practices
- Selected references and further reading
Overview
Reinforcement learning (RL) is a subfield of machine learning concerned with how agents ought to take actions in an environment to maximize cumulative reward. Unlike supervised learning, RL focuses on sequential decision making where actions influence future states and rewards. RL blends dynamic programming, control theory, Monte Carlo methods, and, in modern times, deep learning to handle high-dimensional function approximation.
Historical context and milestones
- 1950s–1970s: Early ideas in optimal control and dynamic programming (Bellman).
- 1980s: Temporal-difference learning (Sutton), credit assignment, and initial theoretical work.
- 1992: Q-learning (Watkins & Dayan), a breakthrough in model-free off-policy learning.
- 1995–2000s: Policy gradient methods (Williams' REINFORCE), actor-critic methods, integration with function approximation.
- 2013–2015: Deep Q-Network (DQN) (Mnih et al.), combining deep neural networks with Q-learning for Atari — ushered in Deep RL era.
- 2016–2018: AlphaGo/AlphaZero (Silver et al.) and continuous control breakthroughs (DDPG, TRPO, PPO).
- 2018–present: Widespread application, expansion into offline RL, meta-RL, multi-agent RL, and RL for language models (e.g., RLHF).
Core concepts and problem formulation
Markov Decision Processes (MDPs)
An MDP is the standard mathematical framework for RL. It is a tuple (S, A, P, R, γ) where:
- S: set of states (finite or infinite).
- A: set of actions.
- P(s' | s, a): transition probability kernel.
- R(s, a) or R(s, a, s'): reward function (expected immediate reward).
- γ ∈ [0,1): discount factor.
The Markov property: the next state distribution depends only on the current state and action, not on the history.
Policies, returns, and objective
A policy π(a | s) defines a (possibly stochastic) action distribution given a state. The return is the cumulative discounted reward: Gt = ∑{k=0}^∞ γ^k R_{t+k+1}.
The RL objective is to find a policy π* that maximizes the expected return from a start state (or under a state distribution): J(π) = E{π}[G0].
Value functions and Bellman equations
- State-value: V^π(s) = Eπ[Gt | S_t = s]
- Action-value (Q-function): Q^π(s,a) = Eπ[Gt | St = s, At = a]
Bellman expectation equations: V^π(s) = E{a∼π(.|s), s'∼P}[R(s,a) + γ V^π(s')] Q^π(s,a) = E{s'∼P}[R(s,a) + γ E_{a'∼π(.|s')}[Q^π(s',a')]]
Bellman optimality equations: V(s) = maxa E{s'∼P}[R(s,a) + γ V(s')] Q(s,a) = E{s'∼P}[R(s,a) + γ max{a'} Q(s',a')]
Optimal policy: π(s) = argmax_a Q(s,a).
Foundational algorithms
Dynamic programming (requires known model)
- Value Iteration: iterate Bellman optimality update.
- Policy Iteration: alternates policy evaluation and policy improvement.
Both converge to optimal value/policy in finite MDPs.
Model-free RL
- Monte Carlo (MC): learn value estimates from complete returns (episodic).
- Temporal-Difference (TD(0)): bootstrapping update V(s) ← V(s) + α [r + γ V(s') − V(s)].
- SARSA (on-policy TD): Q(s,a) ← Q(s,a) + α [r + γ Q(s',a') − Q(s,a)].
- Q-learning (off-policy): Q(s,a) ← Q(s,a) + α [r + γ max_{a'} Q(s',a') − Q(s,a)].
Policy search and policy gradient methods
Directly parameterize a policy π_θ(a|s) and optimize J(θ) via gradients. Key result: policy gradient theorem:
∇θ J(θ) = E{πθ} [∇θ log πθ(a|s) Q^{πθ}(s,a)].
Practical estimators use advantage functions A(s,a) = Q(s,a) − V(s) to reduce variance. REINFORCE (Williams) is a Monte Carlo policy gradient estimator.
Actor-Critic methods
Combine policy (actor) and value function (critic). The critic estimates value or advantage; the actor is updated via policy gradient using the critic’s estimate. Examples: A2C/A3C, DDPG, SAC.
Function approximation and Deep RL
When state/action spaces are large, neural networks approximate value, policy, or model. Deep Q-Network (DQN) uses a neural net Q(s,a; θ) and stabilizes learning via experience replay and target networks. Continuous-action methods include DDPG, TD3, SAC, and policy gradient methods like PPO and TRPO.
Practical algorithmic recipes
Value Iteration (pseudocode) `` Initialize V(s) arbitrarily for all s loop until convergence: for each s in S: V(s) := maxa sum{s'} P(s'|s,a) [ R(s,a,s') + γ V(s') ] ``
Q-learning (tabular) — pseudocode `` Initialize Q(s,a) arbitrarily for each episode: initialize s while s not terminal: choose a from s using policy derived from Q (e.g., ε-greedy) take action a, observe r, s' Q(s,a) := Q(s,a) + α [ r + γ max_{a'} Q(s',a') − Q(s,a) ] s := s' ``
Simple tabular Q-learning example (Python / numpy) for FrozenLake-like environment ```python import numpy as np
states = nS, actions = nA, P and R assumed accessible via env
Q = np.zeros((nS, nA)) alpha = 0.1; gamma = 0.99; eps = 0.1 for ep in range(10000): s = env.reset() done = False while not done: if np.random.rand() < eps: a = np.random.randint(nA) else: a = np.argmax(Q[s]) s2, r, done, _ = env.step(a) Q[s,a] += alpha (r + gamma np.max(Q[s2]) - Q[s,a]) s = s2 ```
DQN core components and loss
- Q-network Q(s,a; θ)
- Target network Q(s,a; θ−)
- Replay buffer of transitions (s,a,r,s',done)
- Minibatch training with loss:
L(θ) = E{(s,a,r,s',d) ∼ D} [ (r + γ (1−d) max{a'} Q(s', a'; θ−) − Q(s,a; θ))^2 ]
High-level DQN loop (pseudocode) `` initialize Q-network weights θ and target weights θ− = θ initialize replay buffer D for each step: select action ε-greedy from Q(s;θ) execute action, store transition in D if ready: sample minibatch from D compute target y = r + γ (1−done) max_a' Q(s',a';θ−) update θ by minimizing (Q(s,a;θ) − y)^2 periodically update θ− ← θ ``
PPO (high-level)
- Use clipped surrogate objective to stabilize policy updates:
L^{CLIP}(θ) = E [ min( r(θ) Â , clip(r(θ), 1−ε, 1+ε) Â ) ] where r(θ) = πθ(a|s) / π{θ_old}(a|s).
Theoretical foundations
Convergence results
- Dynamic programming methods ...