A learning path ready to make your own.

Reinforcement learning

Reinforcement Learning — Concise Comprehensive Summary Reinforcement learning (RL) studies how agents sequentially choose actions to maximize cumulative reward in an environment. It combines ideas from dynamic programming, control, Monte Carlo methods, and (recently) deep learning to handle large state/action spaces and complex function approximation. Core problem formulation MDP: tuple (S, A, P, R, γ) with the Markov property; transitions P(s'|s,a) and reward R. Discount factor γ ∈ [0,1). Policy π(a|s): stochastic or deterministic mapping from states to actions. Objective: maximize expected return J(π)=E[G_0]. Return G_t = ∑_{k=0}^∞ γ^k R_{t+k+1}. Value functions: V^π(s) and Q^π(s,a). Bellman expectation and optimality equations relate values across steps; π*(s)=argmax_a Q*(s,a). Foundational algorithms Dynamic programming (requires known model): Value Iteration, Policy Iteration — converge for finite MDPs. Model-free methods: Monte Carlo: learn from full-episode returns (episodic). Temporal-Difference (TD): bootstrapped updates (e.g., TD(0)). SARSA: on-policy TD for Q(s,a). Q-learning: off-policy, uses max_a' Q(s',a'). Policy search / policy gradients: parameterize π_θ and optimize J(θ) using the policy gradient theorem: ∇_θ J = E[∇_θ log π_θ(a|s) Q^π(s,a)]. Use baselines/advantages to reduce variance (REINFORCE, A2C/A3C). Actor-Critic: actor (policy) + critic (value/advantage estimate) for lower-variance updates; includes DDPG, SAC, A2C/A3C. Deep RL / function approximation: neural networks approximate value/policy/model. Stabilization techniques: experience replay, target networks, double Q, clipped/trust-region updates (TRPO/PPO). Practical algorithmic recipes (high level) Tabular Q-learning: ε-greedy action selection, bootstrapped Q update Q←Q+α(r+γ max Q' − Q). DQN: Q-network + target network + replay buffer. Minimize MSE between Q(s,a;θ) and target y=r+γ max_a' Q(s',a';θ−). PPO: clipped surrogate objective L^{CLIP}(θ)=E[min(r(θ)Â, clip(r(θ),1−ε,1+ε)Â)] to stabilize policy updates. Theoretical foundations Convergence: DP converges in finite MDPs; TD and Q-learning have convergence proofs under tabular/linear settings with proper exploration and learning-rate schedules. Nonlinear function approximation (deep nets) lacks strong guarantees. Policy gradient theorem: gives unbiased gradient form enabling stochastic gradient ascent with variance reduction via baselines. Sample complexity / PAC: PAC-MDP results exist (e.g., R-MAX); modern work seeks tighter, provable sample-efficient algorithms for large spaces. On-policy vs off-policy: tradeoffs between stability (on-policy) and data efficiency (off-policy); off-policy + function approximation can be unstable (the "deadly triad"). Challenges and practical considerations Exploration vs exploitation: ε-greedy, Boltzmann, UCB, Thompson sampling, intrinsic motivation / curiosity, count-based methods; exploration is crucial in sparse/long-horizon tasks. Stability & divergence: bootstrapping + function approximation + off-policy learning can diverge; mitigations include target nets, replay, double Q, conservative updates, ensembles. Sample efficiency: model-based methods, offline RL, prioritized replay, demonstrations, meta-learning help reduce interactions. Reward design: mis-specified rewards cause specification gaming; use shaping carefully, IRL, or human-in-the-loop methods. Reproducibility: sensitivity to hyperparameters, seeds, compute; use multiple seeds, standardized benchmarks, and open implementations. Extensions and subfields Model-based RL: learn/use environment models for planning (Dyna, MPC, PETS); trade-off between sample efficiency and model bias. Offline/batch RL: learning from logged data (BCQ, CQL, conservative methods) — key challenge is distributional shift. Hierarchical RL: options, skills, temporal abstraction for scalability and exploration. Multi-agent RL: cooperation/competition, non-stationarity, equilibrium concepts. Inverse RL & imitation: learn reward/policy from demonstrations (MaxEnt IRL, GAIL). POMDPs: partial observability handled with belief states or recurrent policies. Applications and benchmarks Benchmarks: OpenAI Gym, ALE (Atari), DeepMind Control Suite, MuJoCo, ProcGen, D4RL for offline RL. Applications: games (Atari, Go, Chess, StarCraft, MuZero), robotics (sim2real, manipulation), recommender systems, autonomous vehicles, healthcare, finance, industrial control. State of the art & trends Achievements: human-level performance in many games, scalable self-play and planning (AlphaZero/MuZero), effective continuous control methods (SAC, PPO). Limitations: sample inefficiency, poor generalization, limited safety/interpretability guarantees. Current directions: offline RL, model-based latent dynamics, meta-RL, sim2real/domain randomization, RL for LLM alignment (RLHF), causal RL. Future directions & open problems Provably sample-efficient methods for large state/action spaces. Safe, verifiable, and interpretable RL with formal guarantees. Integration of causality and multi-modal representation learning for robust generalization. Continual/lifelong RL that avoids catastrophic forgetting and supports many tasks. Hardware/software co-design for real-time, large-scale RL deployment and societal governance considerations. Practical checklist & best practices Start with simple baselines and standardized environments. Use multiple random seeds, monitor learning curves, and log value estimates and action distributions. Stabilize training: target networks, replay, normalization, conservative updates, and careful hyperparameter tuning. For sim→real, use domain randomization and robust policies; for offline RL, be conservative about unseen actions. Selected references Sutton & Barto (2018) — Reinforcement Learning: An Introduction (core textbook). Watkins & Dayan (1992) — Q-learning. Williams (1992) — REINFORCE / policy gradient. Mnih et al. (2015) — DQN. Silver et al. (2016–2018) — AlphaGo/AlphaZero/MuZero. Schulman et al. (2015–2017) — TRPO, PPO. Lillicrap et al. (2015) — DDPG; Haarnoja et al. (2018) — SAC. Conclusion: RL is a mature yet rapidly evolving field blending theory and engineering. Major successes demonstrate potential, but key challenges remain in sample efficiency, safety, generalization, and provable guarantees. If you want, I can provide runnable code for a specific algorithm/environment, a deeper dive into any subtopic, or an annotated reading list.

Let the lesson walk with you.

Podcast

Reinforcement learning podcast

0:00-3:48

Follow the trail that experts already trust.

Resources

Turn quick sparks into lasting recall.

Flashcards

Reinforcement learning flashcards

17 cards

Question

Click to flip
Answer

Prove the idea before it slips away.

Quizzes

Reinforcement learning quiz

13 questions

Which of the following correctly lists the components of a Markov Decision Process (MDP)?

Read deeper, connect wider, own the subject.

Deep Article

Reinforcement Learning — A Deep Dive

Table of contents

  • Overview
  • Historical context and milestones
  • Core concepts and problem formulation
  • Markov Decision Processes (MDPs)
  • Policies, returns, and objective
  • Value functions and Bellman equations
  • Optimality and solution concepts
  • Foundational algorithms
  • Dynamic programming (Value Iteration, Policy Iteration)
  • Model-free RL (Monte Carlo, Temporal-Difference, SARSA, Q-learning)
  • Policy search and policy gradient methods
  • Actor-Critic methods
  • Function approximation and Deep RL
  • Practical algorithmic recipes
  • Q-learning pseudocode and Python example (tabular)
  • DQN core loop and loss
  • PPO high-level update
  • Theoretical foundations
  • Convergence results (DP, TD, Q-learning)
  • Policy gradient theorem
  • Sample complexity and PAC bounds (high level)
  • Off-policy vs on-policy distinctions
  • Challenges, pitfalls, and practical considerations
  • Exploration-exploitation tradeoff
  • Stability, catastrophic forgetting, and divergence
  • Sample efficiency and data reuse
  • Reward design and specification gaming
  • Reproducibility and evaluation
  • Extensions and subfields
  • Model-based RL
  • Offline (batch) RL
  • Hierarchical and options frameworks
  • Multi-agent RL
  • Inverse RL and imitation learning
  • Partial observability and POMDPs
  • Applications and benchmarks
  • Games (Atari, Go, Chess)
  • Robotics and control (MuJoCo, real robots)
  • Recommendation systems and ads
  • Autonomous vehicles
  • Healthcare, finance, and industrial control
  • Current state of the art and trends
  • Deep RL successes and limitations
  • Offline RL, meta-RL, and sim2real
  • RL + language models (RLHF)
  • Future directions and open problems
  • Safety, interpretability, and verification
  • Sample-efficient and provably efficient methods
  • Causal and generalizable RL
  • Hardware/software co-design
  • Practical checklist and best practices
  • Selected references and further reading

Overview

Reinforcement learning (RL) is a subfield of machine learning concerned with how agents ought to take actions in an environment to maximize cumulative reward. Unlike supervised learning, RL focuses on sequential decision making where actions influence future states and rewards. RL blends dynamic programming, control theory, Monte Carlo methods, and, in modern times, deep learning to handle high-dimensional function approximation.

Historical context and milestones

  • 1950s–1970s: Early ideas in optimal control and dynamic programming (Bellman).
  • 1980s: Temporal-difference learning (Sutton), credit assignment, and initial theoretical work.
  • 1992: Q-learning (Watkins & Dayan), a breakthrough in model-free off-policy learning.
  • 1995–2000s: Policy gradient methods (Williams' REINFORCE), actor-critic methods, integration with function approximation.
  • 2013–2015: Deep Q-Network (DQN) (Mnih et al.), combining deep neural networks with Q-learning for Atari — ushered in Deep RL era.
  • 2016–2018: AlphaGo/AlphaZero (Silver et al.) and continuous control breakthroughs (DDPG, TRPO, PPO).
  • 2018–present: Widespread application, expansion into offline RL, meta-RL, multi-agent RL, and RL for language models (e.g., RLHF).

Core concepts and problem formulation

Markov Decision Processes (MDPs)

An MDP is the standard mathematical framework for RL. It is a tuple (S, A, P, R, γ) where:

  • S: set of states (finite or infinite).
  • A: set of actions.
  • P(s' | s, a): transition probability kernel.
  • R(s, a) or R(s, a, s'): reward function (expected immediate reward).
  • γ ∈ [0,1): discount factor.

The Markov property: the next state distribution depends only on the current state and action, not on the history.

Policies, returns, and objective

A policy π(a | s) defines a (possibly stochastic) action distribution given a state. The return is the cumulative discounted reward: Gt = ∑{k=0}^∞ γ^k R_{t+k+1}.

The RL objective is to find a policy π* that maximizes the expected return from a start state (or under a state distribution): J(π) = E{π}[G0].

Value functions and Bellman equations

  • State-value: V^π(s) = Eπ[Gt | S_t = s]
  • Action-value (Q-function): Q^π(s,a) = Eπ[Gt | St = s, At = a]

Bellman expectation equations: V^π(s) = E{a∼π(.|s), s'∼P}[R(s,a) + γ V^π(s')] Q^π(s,a) = E{s'∼P}[R(s,a) + γ E_{a'∼π(.|s')}[Q^π(s',a')]]

Bellman optimality equations: V(s) = maxa E{s'∼P}[R(s,a) + γ V(s')] Q(s,a) = E{s'∼P}[R(s,a) + γ max{a'} Q(s',a')]

Optimal policy: π(s) = argmax_a Q(s,a).

Foundational algorithms

Dynamic programming (requires known model)

  • Value Iteration: iterate Bellman optimality update.
  • Policy Iteration: alternates policy evaluation and policy improvement.

Both converge to optimal value/policy in finite MDPs.

Model-free RL

  • Monte Carlo (MC): learn value estimates from complete returns (episodic).
  • Temporal-Difference (TD(0)): bootstrapping update V(s) ← V(s) + α [r + γ V(s') − V(s)].
  • SARSA (on-policy TD): Q(s,a) ← Q(s,a) + α [r + γ Q(s',a') − Q(s,a)].
  • Q-learning (off-policy): Q(s,a) ← Q(s,a) + α [r + γ max_{a'} Q(s',a') − Q(s,a)].

Policy search and policy gradient methods

Directly parameterize a policy π_θ(a|s) and optimize J(θ) via gradients. Key result: policy gradient theorem:

θ J(θ) = Eθ} [∇θ log πθ(a|s) Q^{πθ}(s,a)].

Practical estimators use advantage functions A(s,a) = Q(s,a) − V(s) to reduce variance. REINFORCE (Williams) is a Monte Carlo policy gradient estimator.

Actor-Critic methods

Combine policy (actor) and value function (critic). The critic estimates value or advantage; the actor is updated via policy gradient using the critic’s estimate. Examples: A2C/A3C, DDPG, SAC.

Function approximation and Deep RL

When state/action spaces are large, neural networks approximate value, policy, or model. Deep Q-Network (DQN) uses a neural net Q(s,a; θ) and stabilizes learning via experience replay and target networks. Continuous-action methods include DDPG, TD3, SAC, and policy gradient methods like PPO and TRPO.

Practical algorithmic recipes

Value Iteration (pseudocode) `` Initialize V(s) arbitrarily for all s loop until convergence: for each s in S: V(s) := maxa sum{s'} P(s'|s,a) [ R(s,a,s') + γ V(s') ] ``

Q-learning (tabular) — pseudocode `` Initialize Q(s,a) arbitrarily for each episode: initialize s while s not terminal: choose a from s using policy derived from Q (e.g., ε-greedy) take action a, observe r, s' Q(s,a) := Q(s,a) + α [ r + γ max_{a'} Q(s',a') − Q(s,a) ] s := s' ``

Simple tabular Q-learning example (Python / numpy) for FrozenLake-like environment ```python import numpy as np

states = nS, actions = nA, P and R assumed accessible via env

Q = np.zeros((nS, nA)) alpha = 0.1; gamma = 0.99; eps = 0.1 for ep in range(10000): s = env.reset() done = False while not done: if np.random.rand() < eps: a = np.random.randint(nA) else: a = np.argmax(Q[s]) s2, r, done, _ = env.step(a) Q[s,a] += alpha (r + gamma np.max(Q[s2]) - Q[s,a]) s = s2 ```

DQN core components and loss

  • Q-network Q(s,a; θ)
  • Target network Q(s,a; θ−)
  • Replay buffer of transitions (s,a,r,s',done)
  • Minibatch training with loss:

L(θ) = E{(s,a,r,s',d) ∼ D} [ (r + γ (1−d) max{a'} Q(s', a'; θ−) − Q(s,a; θ))^2 ]

High-level DQN loop (pseudocode) `` initialize Q-network weights θ and target weights θ− = θ initialize replay buffer D for each step: select action ε-greedy from Q(s;θ) execute action, store transition in D if ready: sample minibatch from D compute target y = r + γ (1−done) max_a' Q(s',a';θ−) update θ by minimizing (Q(s,a;θ) − y)^2 periodically update θ− ← θ ``

PPO (high-level)

  • Use clipped surrogate objective to stabilize policy updates:

L^{CLIP}(θ) = E [ min( r(θ) Â , clip(r(θ), 1−ε, 1+ε) Â ) ] where r(θ) = πθ(a|s) / π{θ_old}(a|s).

Theoretical foundations

Convergence results

  • Dynamic programming methods ...

Ready to see the full tree?

Clone the preview to open the complete learning structure, practice tools, and generated study materials.