A learning path ready to make your own.

What is reinforcement learning?

Reinforcement Learning — Concise Summary Reinforcement Learning (RL) is a paradigm for learning sequential decision-making by interaction: an agent observes states, takes actions, receives scalar rewards, and learns a policy to maximize expected cumulative (discounted) reward. RL integrates ideas from dynamic programming, control, statistics and trial‑and‑error learning. Problem formulation (MDP) Markov Decision Process (MDP): tuple (S, A, P, R, γ) where S = states, A = actions, P(s'|s,a) = transitions, R = reward function, γ ∈ [0,1) = discount factor. Objective: find policy π(a|s) maximizing expected discounted return G_t = Σ_{k=0}^∞ γ^k r_{t+k+1}. Value functions: V^π(s) = Eπ[G_t | S_t=s], Q^π(s,a) = Eπ[G_t | S_t=s, A_t=a]. Core concepts States, actions, rewards, policies: policy π can be deterministic or stochastic. Return & discounting: tradeoff between immediate and future rewards via γ. Bellman equations: recursive relations for V^π, Q^π and their optimal counterparts V*, Q*; solving them underpins many algorithms. Algorithmic families Dynamic Programming: model-based iterative policy evaluation/improvement (value iteration, policy iteration). Monte Carlo: learn from complete episodic returns—no bootstrapping. Temporal-Difference (TD) learning: bootstrapped one-step or multi-step updates (TD(0), TD(λ)). Q-Learning & SARSA: off-policy (Q-learning) and on-policy (SARSA) action-value methods for model-free control. Policy Gradients: directly optimize parameterized policies (REINFORCE, baseline variance reduction). Actor–Critic: actor updates policy using critic's value estimates (A2C/A3C, PPO, DDPG, SAC). Deep Reinforcement Learning Use neural networks to approximate policies/values, enabling high-dimensional inputs (e.g., pixels). Key milestones: DQN (experience replay, target networks), Double/Dueling DQN, Prioritized Replay, Rainbow. Policy optimization: A3C/A2C, PPO (stable gradient updates). Continuous control: DDPG, TD3, SAC (entropy-regularized). Model-based & planning: learned dynamics (MBPO), latent planners (MuZero). Main challenges: instability (bootstrapping + function approximation + off-policy), sample inefficiency, high variance. Model-free vs. Model-based; On-policy vs. Off-policy Model-free: learn policy/value directly (simpler, often sample-inefficient). Model-based: learn dynamics model and plan or imagine trajectories (more sample-efficient but sensitive to model bias). On-policy: uses data from current policy (PPO, A2C)—stable but less reusable. Off-policy: can reuse past data via replay buffers (DQN, DDPG, SAC)—more sample-efficient but can be unstable). Exploration vs. Exploitation Tradeoff strategies: ε-greedy, softmax, action noise (continuous), entropy regularization, intrinsic curiosity, Thompson sampling. Sparse rewards amplify exploration difficulty; reward shaping and curricula are common remedies. Stability issues — the "deadly triad" Combining (1) function approximation, (2) bootstrapping, and (3) off-policy updates can cause divergence. Practical mitigations: replay buffers, target networks, careful optimization and normalization. Extensions Multi-Agent RL: cooperation/competition with nonstationarity and scalability challenges. Hierarchical RL: temporal abstraction / options to improve exploration and reuse skills. Inverse RL & Imitation: infer rewards from demonstrations (GAIL, behavioral cloning) and bootstrap learning. Practical engineering practices Replay buffers, target networks, observation/reward normalization, gradient clipping. Reward shaping, curriculum learning, domain randomization (sim-to-real). Evaluation: held-out environments, deterministic evaluation, multiple seeds, metrics beyond raw reward. Popular benchmarks & libraries: Atari (ALE), MuJoCo/DMControl, Procgen, OpenAI Gym/Gymnasium, Stable Baselines3, RLlib, Acme, CleanRL. Applications Games (Atari, Go, Chess, StarCraft, Dota), robotics (locomotion, manipulation), autonomous driving, recommender systems, finance, healthcare, energy systems, RLHF for language models. Current research directions Scalable algorithms (PPO, SAC), model-based hybrids (MBPO, MuZero), offline/batch RL, sim-to-real transfer, robustness and generalization, safety and interpretability, combining RL with representation learning and causality. Open challenges Sample efficiency, safety/reliability, interpretability, transfer and continual learning, exploration in sparse/high-dimensional tasks, theoretical guarantees for function approximation and nonstationary settings, human-aware and fair RL. Examples & evaluation Common illustrative implementations: tabular Q-learning for small discrete MDPs and DQN sketches for discrete control (policy/value nets, replay buffer, target network). Evaluation metrics: cumulative return, sample complexity, stability across seeds, asymptotic performance, generalization, safety/risk measures. Recommended resources Sutton & Barto — "Reinforcement Learning: An Introduction". Mnih et al. (DQN), Schulman et al. (PPO), Haarnoja et al. (SAC), Lillicrap et al. (DDPG). Practical guides: OpenAI Spinning Up, David Silver’s RL course; libraries listed above. Conclusion: RL is a powerful, general framework for sequential decision-making that has advanced from tabular DP to deep, high-dimensional agents. Despite major successes, core issues—sample efficiency, stability, safety, and generalization—remain active research areas. If you want, I can provide a full working DQN/PPO implementation for a specific Gym environment, walk through solving a concrete toy problem step‑by‑step, or summarize recent papers in a subdomain.

Let the lesson walk with you.

Podcast

What is reinforcement learning? podcast

0:00-2:27

Follow the trail that experts already trust.

Resources

Turn quick sparks into lasting recall.

Flashcards

What is reinforcement learning? flashcards

18 cards

Question

Click to flip
Answer

Prove the idea before it slips away.

Quizzes

What is reinforcement learning? quiz

15 questions

What is the primary objective of reinforcement learning (RL)?

Read deeper, connect wider, own the subject.

Deep Article

What is Reinforcement Learning?

Reinforcement Learning (RL) is a subfield of machine learning concerned with how agents ought to take actions in an environment so as to maximize cumulative reward. Unlike supervised learning, which learns from labeled examples, RL learns from interaction: an agent observes states, takes actions, receives scalar rewards, and updates behavior to improve long-term performance. RL combines ideas from dynamic programming, optimal control, psychology (trial-and-error learning), and statistics.

This article is a comprehensive deep dive into reinforcement learning: history, core concepts, mathematical foundations, algorithms, practical considerations, applications, current state-of-the-art, open challenges, and future directions. Code snippets and pseudocode illustrate implementation patterns.


Table of contents

  • High-level intuition
  • Historical background
  • Problem formulation: Markov Decision Processes
  • Core concepts and components
  • States, actions, rewards, policies
  • Return, discounting
  • Value functions
  • Bellman equations
  • Optimality
  • Classic algorithmic families
  • Dynamic programming
  • Monte Carlo methods
  • Temporal Difference (TD) learning
  • Q-learning and SARSA
  • Policy gradient methods
  • Actor–critic methods
  • Deep Reinforcement Learning (Deep RL)
  • DQN, improvements, and stability tricks
  • Continuous control: DDPG, TD3, SAC
  • Policy optimization: TRPO, PPO
  • Model-free vs model-based RL
  • On-policy vs off-policy; sample complexity
  • Exploration vs exploitation
  • Function approximation and the "deadly triad"
  • Multi-agent, hierarchical, inverse RL
  • Practical engineering practices
  • Replay buffers, target networks, normalization
  • Reward shaping and curriculum learning
  • Sim-to-real transfer and domain randomization
  • Evaluation and benchmarks
  • Applications and examples
  • Games, robotics, recommender systems, healthcare, finance, and more
  • Current state of research
  • Open challenges and future directions
  • Example implementations (Q-learning; DQN sketch)
  • Recommended resources and further reading

High-level intuition

Imagine teaching a robot to walk. You cannot provide direct supervision mapping sensor inputs to optimal torques. Instead, you let the robot try actions; when it falls, give a negative reward; when it walks, give a positive reward. Over many trials, the robot learns which actions in which states lead to better cumulative outcomes.

Reinforcement learning formalizes this process: agents learn to maximize expected cumulative reward—often under uncertainty, partial observability, and delayed consequences.


Historical background

  • Early roots: optimal control and dynamic programming (Richard Bellman, 1950s).
  • Behavioral psychology: trial-and-error learning, reinforcement in animal learning.
  • Temporal difference learning: algorithms like TD(λ) (Sutton, 1988).
  • Q-learning: off-policy algorithm by Watkins (1989).
  • Policy gradient methods: REINFORCE (Williams, 1992).
  • The "deep RL" revolution: Deep Q-Networks (DQN) by Mnih et al. (2013/2015) used deep neural nets to approximate value functions, achieving human-level play on many Atari games.
  • A burst of progress since 2013: actor-critic deep methods (A3C, PPO), continuous control algorithms (DDPG, SAC, TD3), large-scale RL for games (AlphaGo, AlphaZero, MuZero) and simulation-to-real robotics advances.

Primary influential textbook: Sutton & Barto, "Reinforcement Learning: An Introduction".


Problem formulation: Markov Decision Processes (MDPs)

Reinforcement learning tasks are commonly modeled as Markov Decision Processes (MDPs):

An MDP is a tuple (S, A, P, R, γ) where:

  • S: set of states
  • A: set of actions (can be discrete or continuous)
  • P(s' | s, a): transition probability—probability of next state s' given current state s and action a
  • R(s, a, s'): reward function (or R(s, a)) giving scalar reward
  • γ ∈ [0,1): discount factor for future rewards

Objective: find a policy π(a | s) that maximizes expected discounted return: Gt = sum{k=0}^∞ γ^k R_{t+k+1}

Expected return from state s following policy π: V^π(s) = Eπ[ Gt | St = s ]

Action-value function: Q^π(s, a) = Eπ[ Gt | St = s, A_t = a ]

The problem of RL is solving for an optimal policy π that maximizes V^π(s) for all s; corresponding optimal value functions V(s) and Q*(s,a).


Core concepts and components

States, actions, rewards, and policies

  • State (s): representation of an environment at a time.
  • Action (a): choice available to agent.
  • Reward (r): scalar feedback; the only learning signal.
  • Policy (π): mapping from states to actions (deterministic or stochastic).

Return and discounting

  • Return Gt = ∑{k=0}^∞ γ^k r_{t+k+1}.
  • Discount factor γ balances present vs future rewards. γ close to 1 values long-term reward.

Value functions

  • State-value V^π(s): expected return from state s under policy π.
  • Action-value Q^π(s, a): expected return after taking action a in state s under π.

Bellman expectation equations

The Bellman expectation equations provide recursive decompositions: V^π(s) = E{a∼π(s), s'∼P}[ r(s,a,s') + γ V^π(s') ] Q^π(s,a) = E{s'∼P}[ r(s,a,s') + γ E_{a'∼π(s')}[ Q^π(s',a') ] ]

Bellman optimality equations

For optimal V and Q: V(s) = maxa E{s'∼P}[ r(s,a,s') + γ V(s') ] Q(s,a) = E{s'∼P}[ r(s,a,s') + γ max{a'} Q(s',a') ]

Solving these equations is central to many RL algorithms.


Classic algorithmic families

Dynamic Programming (DP)

  • Requires full model (P and R) and solves Bellman equations via iterative updates: policy evaluation (compute V^π) and policy improvement (make policy greedy wrt V).
  • Examples: value iteration, policy iteration.
  • Converges with guarantees for finite MDPs, but impractical for large or unknown models.

Monte Carlo (MC) methods

  • Learn from episodic experience, estimating returns by averaging actual returns following visits to states/actions.
  • No bootstrapping: updates use complete returns.
  • Useful when model unknown and episodes finite.

Temporal Difference (TD) learning

  • Combines ideas from DP and MC: updates using bootstrapped estimate (one-step lookahead).
  • TD(0) update: V(s) ← V(s) + α [r + γ V(s') − V(s)]
  • TD(λ): uses eligibility traces to trade bias/variance via parameter λ ∈ [0,1].

Q-Learning and SARSA

  • Q-Learning (Watkins, 1989): off-policy action-value method that learns Q* without model.

Q(s,a) ← Q(s,a) + α [r + γ max_{a'} Q(s',a') − Q(s,a)] Converges under conditions (finite state/action, sufficient exploration, decaying α).

  • SARSA: on-policy counterpart.

Q(s,a) ← Q(s,a) + α [r + γ Q(s',a') − Q(s,a)] where a' is actually taken.

Policy Gradient methods

  • Directly parameterize policy π_θ(a|s) and increase expected return J(θ) by gradient ascent.
  • REINFORCE (Monte Carlo policy gradient): ∇θ J(θ) ≈ E[∑t ∇θ log πθ(at|st) G_t]
  • Use baseline functions to reduce variance (e.g., subtract V(s_t)).
  • Better for continuous action spaces and stochastic policies.

Actor–Critic methods

  • Combine value function (critic) with policy (actor).
  • Critic estimates V^π or Q^π; actor updates policy parameters using critic's signals (usually the advantage).
  • Can be on-policy (A2C, A3C, PPO) or off-policy (DDPG, SAC with critics).

Deep Reinforcement Learning (Deep RL)

Deep RL uses neural networks as function approximators for policies and/or value functions. This enables scaling to high-dimensional inputs (images, LIDAR).

Key milestones and algorithms:

  • DQN (Mnih et al., 2015): used convolutional networks to learn Q-values from pixels on Atari games.

Stability tricks used: experience replay, target networks, reward clipping, gradient clipping.

  • Improvements to DQN: Double DQN, Dueling DQN, Prioritized Experience Replay, Rainbow (combines multiple improvements).
  • Policy optimization: A3C/A2C (asynchronous/advantage actor-critic), PPO (Proximal Policy Optimization) for stable policy updates.
  • Continuous control algorithms: DDPG (Deterministic Policy Gradient), TD3 (Twin Delayed DDPG), SAC (Soft Actor-Critic) which adds entropy regularization for exploration and stability.
  • Model-based deep RL: learning dynamics models and planning (e.g., MBPO).
  • Learned planning methods: MuZero learns a model for planning in latent space.

Challenges in deep RL: instability due to bootstrapping + function approximation + off-policy updates (the “deadly triad”), sample inefficiency, and high variance gradients.


Model-free vs model-based RL

  • Model-free RL: learns policy and/or value without explicit dynamics model (e.g., Q-learning, policy gradient). Simpler but often sample-inefficient.
  • Model-based RL: learns a model P̂(s'|s,a) and uses planning or imagination to improve policy. Can be more sample-efficient but sensitive to model errors. Hybrid approaches mix both (e.g., Dyna, MBPO).

Advantages of model-based: rapid learning, sample efficiency. Disadvantages: model bias, complexity of planning.


On-policy vs Off-policy; Sample Complexity

  • On-policy methods (e.g., REINFORCE, PPO, A2C): data collected by the current policy used to update itself. Safer/stabler but less sample-efficient because they cannot reuse old data as effectively.
  • Off-policy methods (e.g., Q-learning, DQN, DDPG, SAC): can learn from data collected with different policies (e.g., stored in replay buffers), enabling reuse of samples and better sample efficiency.

Sample complexity is a critical issue: many deep RL methods require millions of environment interactions. Offline (batch) RL addresses learning solely from pre-collected datasets.


Exploration vs exploitation

A core RL dilemma: choose between:

  • Exploitation: choose known best action to maximize immediate reward.
  • Exploration: try uncertain actions to discover potentially better long-term rewards.

Common exploration strategies:

  • ε-greedy (discrete actions)
  • Softmax / Boltzmann sampling
  • Action-noise (continuous actions): Gaussian, Ornstein-Uhlenbeck
  • Entropy regularization (policy gradient / SAC)
  • Intrinsic motivation / curiosity: reward internal novelty signals (prediction error, state-visitation counts, pseudo-counts)
  • Posterior sampling (Thompson sampling); Bayesian RL

Sparse rewards are particularly challenging; intrinsic rewards and shaped rewards are common remedies.


Function approximation and the "deadly triad"

When using function approximators (neural nets) with bootstrapping and off-policy updates, instability or divergence can occur. This combination is called the deadly triad:

  1. Function approximation
  2. Bootstrapping (using current estimates to update)...

Ready to see the full tree?

Clone the preview to open the complete learning structure, practice tools, and generated study materials.