A learning path ready to make your own.

Reinforcement learning

Reinforcement Learning — Concise Comprehensive Summary Reinforcement learning (RL) studies how agents sequentially choose actions to maximize cumulative reward in an environment. It combines ideas from dynamic programming, control, Monte Carlo methods, and (recently) deep learning to handle large state/action spaces and complex function approximation. Core problem formulation MDP: tuple (S, A, P, R, γ) with the Markov property; transitions P(s'|s,a) and reward R. Discount factor γ ∈ [0,1). Policy π(a|s): stochastic or deterministic mapping from states to actions. Objective: maximize expected return J(π)=E[G_0]. Return G_t = ∑_{k=0}^∞ γ^k R_{t+k+1}. Value functions: V^π(s) and Q^π(s,a). Bellman expectation and optimality equations relate values across steps; π*(s)=argmax_a Q*(s,a). Foundational algorithms Dynamic programming (requires known model): Value Iteration, Policy Iteration — converge for finite MDPs. Model-free methods: Monte Carlo: learn from full-episode returns (episodic). Temporal-Difference (TD): bootstrapped updates (e.g., TD(0)). SARSA: on-policy TD for Q(s,a). Q-learning: off-policy, uses max_a' Q(s',a'). Policy search / policy gradients: parameterize π_θ and optimize J(θ) using the policy gradient theorem: ∇_θ J = E[∇_θ log π_θ(a|s) Q^π(s,a)]. Use baselines/advantages to reduce variance (REINFORCE, A2C/A3C). Actor-Critic: actor (policy) + critic (value/advantage estimate) for lower-variance updates; includes DDPG, SAC, A2C/A3C. Deep RL / function approximation: neural networks approximate value/policy/model. Stabilization techniques: experience replay, target networks, double Q, clipped/trust-region updates (TRPO/PPO). Practical algorithmic recipes (high level) Tabular Q-learning: ε-greedy action selection, bootstrapped Q update Q←Q+α(r+γ max Q' − Q). DQN: Q-network + target network + replay buffer. Minimize MSE between Q(s,a;θ) and target y=r+γ max_a' Q(s',a';θ−). PPO: clipped surrogate objective L^{CLIP}(θ)=E[min(r(θ)Â, clip(r(θ),1−ε,1+ε)Â)] to stabilize policy updates. Theoretical foundations Convergence: DP converges in finite MDPs; TD and Q-learning have convergence proofs under tabular/linear settings with proper exploration and learning-rate schedules. Nonlinear function approximation (deep nets) lacks strong guarantees. Policy gradient theorem: gives unbiased gradient form enabling stochastic gradient ascent with variance reduction via baselines. Sample complexity / PAC: PAC-MDP results exist (e.g., R-MAX); modern work seeks tighter, provable sample-efficient algorithms for large spaces. On-policy vs off-policy: tradeoffs between stability (on-policy) and data efficiency (off-policy); off-policy + function approximation can be unstable (the "deadly triad"). Challenges and practical considerations Exploration vs exploitation: ε-greedy, Boltzmann, UCB, Thompson sampling, intrinsic motivation / curiosity, count-based methods; exploration is crucial in sparse/long-horizon tasks. Stability & divergence: bootstrapping + function approximation + off-policy learning can diverge; mitigations include target nets, replay, double Q, conservative updates, ensembles. Sample efficiency: model-based methods, offline RL, prioritized replay, demonstrations, meta-learning help reduce interactions. Reward design: mis-specified rewards cause specification gaming; use shaping carefully, IRL, or human-in-the-loop methods. Reproducibility: sensitivity to hyperparameters, seeds, compute; use multiple seeds, standardized benchmarks, and open implementations. Extensions and subfields Model-based RL: learn/use environment models for planning (Dyna, MPC, PETS); trade-off between sample efficiency and model bias. Offline/batch RL: learning from logged data (BCQ, CQL, conservative methods) — key challenge is distributional shift. Hierarchical RL: options, skills, temporal abstraction for scalability and exploration. Multi-agent RL: cooperation/competition, non-stationarity, equilibrium concepts. Inverse RL & imitation: learn reward/policy from demonstrations (MaxEnt IRL, GAIL). POMDPs: partial observability handled with belief states or recurrent policies. Applications and benchmarks Benchmarks: OpenAI Gym, ALE (Atari), DeepMind Control Suite, MuJoCo, ProcGen, D4RL for offline RL. Applications: games (Atari, Go, Chess, StarCraft, MuZero), robotics (sim2real, manipulation), recommender systems, autonomous vehicles, healthcare, finance, industrial control. State of the art & trends Achievements: human-level performance in many games, scalable self-play and planning (AlphaZero/MuZero), effective continuous control methods (SAC, PPO). Limitations: sample inefficiency, poor generalization, limited safety/interpretability guarantees. Current directions: offline RL, model-based latent dynamics, meta-RL, sim2real/domain randomization, RL for LLM alignment (RLHF), causal RL. Future directions & open problems Provably sample-efficient methods for large state/action spaces. Safe, verifiable, and interpretable RL with formal guarantees. Integration of causality and multi-modal representation learning for robust generalization. Continual/lifelong RL that avoids catastrophic forgetting and supports many tasks. Hardware/software co-design for real-time, large-scale RL deployment and societal governance considerations. Practical checklist & best practices Start with simple baselines and standardized environments. Use multiple random seeds, monitor learning curves, and log value estimates and action distributions. Stabilize training: target networks, replay, normalization, conservative updates, and careful hyperparameter tuning. For sim→real, use domain randomization and robust policies; for offline RL, be conservative about unseen actions. Selected references Sutton & Barto (2018) — Reinforcement Learning: An Introduction (core textbook). Watkins & Dayan (1992) — Q-learning. Williams (1992) — REINFORCE / policy gradient. Mnih et al. (2015) — DQN. Silver et al. (2016–2018) — AlphaGo/AlphaZero/MuZero. Schulman et al. (2015–2017) — TRPO, PPO. Lillicrap et al. (2015) — DDPG; Haarnoja et al. (2018) — SAC. Conclusion: RL is a mature yet rapidly evolving field blending theory and engineering. Major successes demonstrate potential, but key challenges remain in sample efficiency, safety, generalization, and provable guarantees. If you want, I can provide runnable code for a specific algorithm/environment, a deeper dive into any subtopic, or an annotated reading list.

Open full tree

Follow the trail that experts already trust.

Resources

8:40

Read deeper, connect wider, own the subject.

Deep Article

Reinforcement Learning — A Deep Dive

Table of contents

Overview
Historical context and milestones
Core concepts and problem formulation
Markov Decision Processes (MDPs)
Policies, returns, and objective
Value functions and Bellman equations
Optimality and solution concepts
Foundational algorithms
Dynamic programming (Value Iteration, Policy Iteration)
Model-free RL (Monte Carlo, Temporal-Difference, SARSA, Q-learning)
Policy search and policy gradient methods
Actor-Critic methods
Function approximation and Deep RL
Practical algorithmic recipes
Q-learning pseudocode and Python example (tabular)
DQN core loop and loss
PPO high-level update
Theoretical foundations
Convergence results (DP, TD, Q-learning)
Policy gradient theorem
Sample complexity and PAC bounds (high level)
Off-policy vs on-policy distinctions
Challenges, pitfalls, and practical considerations
Exploration-exploitation tradeoff
Stability, catastrophic forgetting, and divergence
Sample efficiency and data reuse
Reward design and specification gaming
Reproducibility and evaluation
Extensions and subfields
Model-based RL
Offline (batch) RL
Hierarchical and options frameworks
Multi-agent RL
Inverse RL and imitation learning
Partial observability and POMDPs
Applications and benchmarks
Games (Atari, Go, Chess)
Robotics and control (MuJoCo, real robots)
Recommendation systems and ads
Autonomous vehicles
Healthcare, finance, and industrial control
Current state of the art and trends
Deep RL successes and limitations
Offline RL, meta-RL, and sim2real
RL + language models (RLHF)
Future directions and open problems
Safety, interpretability, and verification
Sample-efficient and provably efficient methods
Causal and generalizable RL
Hardware/software co-design
Practical checklist and best practices
Selected references and further reading

Overview

Reinforcement learning (RL) is a subfield of machine learning concerned with how agents ought to take actions in an environment to maximize cumulative reward. Unlike supervised learning, RL focuses on sequential decision making where actions influence future states and rewards. RL blends dynamic programming, control theory, Monte Carlo methods, and, in modern times, deep learning to handle high-dimensional function approximation.

Historical context and milestones

1950s–1970s: Early ideas in optimal control and dynamic programming (Bellman).
1980s: Temporal-difference learning (Sutton), credit assignment, and initial theoretical work.
1992: Q-learning (Watkins & Dayan), a breakthrough in model-free off-policy learning.
1995–2000s: Policy gradient methods (Williams' REINFORCE), actor-critic methods, integration with function approximation.
2013–2015: Deep Q-Network (DQN) (Mnih et al.), combining deep neural networks with Q-learning for Atari — ushered in Deep RL era.
2016–2018: AlphaGo/AlphaZero (Silver et al.) and continuous control breakthroughs (DDPG, TRPO, PPO).
2018–present: Widespread application, expansion into offline RL, meta-RL, multi-agent RL, and RL for language models (e.g., RLHF).

Core concepts and problem formulation

Markov Decision Processes (MDPs)

An MDP is the standard mathematical framework for RL. It is a tuple (S, A, P, R, γ) where:

S: set of states (finite or infinite).
A: set of actions.
P(s' | s, a): transition probability kernel.
R(s, a) or R(s, a, s'): reward function (expected immediate reward).
γ ∈ [0,1): discount factor.

The Markov property: the next state distribution depends only on the current state and action, not on the history.

Policies, returns, and objective

A policy π(a | s) defines a (possibly stochastic) action distribution given a state. The return is the cumulative discounted reward: Gt = ∑{k=0}^∞ γ^k R_{t+k+1}.

The RL objective is to find a policy π* that maximizes the expected return from a start state (or under a state distribution): J(π) = E{π}[G0].

Value functions and Bellman equations

State-value: V^π(s) = Eπ[Gt | S_t = s]
Action-value (Q-function): Q^π(s,a) = Eπ[Gt | St = s, At = a]

Bellman expectation equations: V^π(s) = E{a∼π(.|s), s'∼P}[R(s,a) + γ V^π(s')] Q^π(s,a) = E{s'∼P}[R(s,a) + γ E_{a'∼π(.|s')}[Q^π(s',a')]]

Bellman optimality equations: V(s) = maxa E{s'∼P}[R(s,a) + γ V(s')] Q(s,a) = E{s'∼P}[R(s,a) + γ max{a'} Q(s',a')]

Optimal policy: π(s) = argmax_a Q(s,a).

Foundational algorithms

Dynamic programming (requires known model)

Value Iteration: iterate Bellman optimality update.
Policy Iteration: alternates policy evaluation and policy improvement.

Both converge to optimal value/policy in finite MDPs.

Model-free RL

Monte Carlo (MC): learn value estimates from complete returns (episodic).
Temporal-Difference (TD(0)): bootstrapping update V(s) ← V(s) + α [r + γ V(s') − V(s)].
SARSA (on-policy TD): Q(s,a) ← Q(s,a) + α [r + γ Q(s',a') − Q(s,a)].
Q-learning (off-policy): Q(s,a) ← Q(s,a) + α [r + γ max_{a'} Q(s',a') − Q(s,a)].

Policy search and policy gradient methods

Directly parameterize a policy π_θ(a|s) and optimize J(θ) via gradients. Key result: policy gradient theorem:

∇θ J(θ) = E{πθ} [∇θ log πθ(a|s) Q^{πθ}(s,a)].

Practical estimators use advantage functions A(s,a) = Q(s,a) − V(s) to reduce variance. REINFORCE (Williams) is a Monte Carlo policy gradient estimator.

Actor-Critic methods

Combine policy (actor) and value function (critic). The critic estimates value or advantage; the actor is updated via policy gradient using the critic’s estimate. Examples: A2C/A3C, DDPG, SAC.

Function approximation and Deep RL

When state/action spaces are large, neural networks approximate value, policy, or model. Deep Q-Network (DQN) uses a neural net Q(s,a; θ) and stabilizes learning via experience replay and target networks. Continuous-action methods include DDPG, TD3, SAC, and policy gradient methods like PPO and TRPO.

Practical algorithmic recipes

Value Iteration (pseudocode) `` Initialize V(s) arbitrarily for all s loop until convergence: for each s in S: V(s) := maxa sum{s'} P(s'|s,a) [ R(s,a,s') + γ V(s') ] ``

Q-learning (tabular) — pseudocode `` Initialize Q(s,a) arbitrarily for each episode: initialize s while s not terminal: choose a from s using policy derived from Q (e.g., ε-greedy) take action a, observe r, s' Q(s,a) := Q(s,a) + α [ r + γ max_{a'} Q(s',a') − Q(s,a) ] s := s' ``

Simple tabular Q-learning example (Python / numpy) for FrozenLake-like environment ```python import numpy as np

states = nS, actions = nA, P and R assumed accessible via env

Q = np.zeros((nS, nA)) alpha = 0.1; gamma = 0.99; eps = 0.1 for ep in range(10000): s = env.reset() done = False while not done: if np.random.rand() < eps: a = np.random.randint(nA) else: a = np.argmax(Q[s]) s2, r, done, _ = env.step(a) Q[s,a] += alpha (r + gamma np.max(Q[s2]) - Q[s,a]) s = s2 ```

DQN core components and loss

Q-network Q(s,a; θ)
Target network Q(s,a; θ−)
Replay buffer of transitions (s,a,r,s',done)
Minibatch training with loss:

L(θ) = E{(s,a,r,s',d) ∼ D} [ (r + γ (1−d) max{a'} Q(s', a'; θ−) − Q(s,a; θ))^2 ]

High-level DQN loop (pseudocode) `` initialize Q-network weights θ and target weights θ− = θ initialize replay buffer D for each step: select action ε-greedy from Q(s;θ) execute action, store transition in D if ready: sample minibatch from D compute target y = r + γ (1−done) max_a' Q(s',a';θ−) update θ by minimizing (Q(s,a;θ) − y)^2 periodically update θ− ← θ ``

PPO (high-level)

Use clipped surrogate objective to stabilize policy updates:

L^{CLIP}(θ) = E [ min( r(θ) Â , clip(r(θ), 1−ε, 1+ε) Â ) ] where r(θ) = πθ(a|s) / π{θ_old}(a|s).

Theoretical foundations

Convergence results

Dynamic programming methods ...

Ready to see the full tree?

Clone the preview to open the complete learning structure, practice tools, and generated study materials.

Reinforcement learning

AI Learns to Walk (deep reinforcement learning)

AI Learns Red Light Green Light (reinforcement learning)

RL Course by David Silver - Lecture 1: Introduction to Reinforcement Learning

The FASTEST introduction to Reinforcement Learning on the internet