---
title: Reinforcement Learning
cluster: D
contributors:
  - David Quarel (Australian National University)
  - Leon Lang (Iliad)
summary: Foundations of reinforcement learning — Markov decision processes, the
  Bellman equation, policy improvement, and the SARSA/Q-learning algorithms,
  with parallel empirical and theory tracks.
learningOutcomes:
  - Understand Markov Decision Processes and the goal of the agent, for known
    environments.
  - Understand the Bellman equation.
  - Understand the policy improvement theorem, and how we can use it to
    iteratively solve for an optimal policy.
  - "Empirical track: Understand the anatomy of a gym.Env, so that you feel
    comfortable using them and writing your own."
  - "Empirical track: Understand SARSA and Q-learning and the difference between
    on-policy and off-policy methods."
  - "Empirical track: Implement SARSA and Q-Learning, and compare them on
    different environments."
  - "Empirical track: Understand the TD(λ) algorithm, and how it can we used to
    mix over short and long timescale updates."
  - "Theory track: Prove properties involving the Bellman equations, including
    the existence of optimal policies, the policy improvement theorem, rate of
    convergence of Bellman updates, and the convergence of Q-learning."
---
## Overview

We provide a brief introduction to Reinforcement Learning from the fundamentals, covering tabular RL (chapters 2-4 of Sutton and Barto) in two streams. The empirical stream directly follows [Day 1 of ARENA](https://learn.arena.education/chapter2_rl/01_intro_rl/) and covers implementing policy iteration/evaluation, Q-learning, and SARSA for toy gridworld environments in Python. The theory stream proves a series of results including the Bellman equations, the convergence of policy iteration and its rate, and the convergence of Q-learning, and derives an analytic solution to the Bellman equation.

-   Theory Workshop material produced by Leon Lang and David Quarel
    
-   Lecture slides taken from the ARENA program (authored by David Quarel)
    
-   Empirical exercises taken from Day 1 of the ARENA program
    
-   Content was delivered by David Quarel
    

## Prerequisites

-   No prerequisites are assumed. All the RL material is self-contained.
    
-   The theory track requires familiarity with proofs.
    

## Content

### Fast track

Pretty hard to fast track it, best approach I would give is to read and understand the lecture slides, and:

-   Empirical track: Run the code, and understand the solutions for policy iteration/improvement (not the exact form), and read/run Q-learning
    
-   Theory track: Just read the solutions for Problems 1 and 4 and make sure you understand each step
    

### Main content

-   [Self-contained lecture notes](https://www.overleaf.com/read/nzcgzksgzgvz#4f4bf3)
    
    -   This has everything that both empirical and theory track should need for reference.
        
-   **Empirical track:** Work through [ARENA RL Day 1](https://learn.arena.education/chapter2_rl/01_intro_rl/intro)
    
-   **Theory track** works through a [problem sheet](https://www.overleaf.com/read/wcsvcksbnkzr#86bb1e)
    

(static compiled versions of lecture slides + theory exercises [here](https://drive.google.com/drive/folders/1Qrh7OIvu6Fh2CQFuFNyTJGKqL2g4yW9N?usp=drive_link))

### Learn more

-   [Sutton&Barto](/uploads/reinforcement-learning/BartoSutton.pdf)
    
    -   Chapter 3: Sections 3.1, 3.2, 3.3, 3.4, 3.5, 3.6
        
    -   Chapter 4: Sections 4.1, 4.2, 4.3, 4.4
        
    -   Chapter 6, Section 6.1, 6.3 (Especially Example 6.4)
        
        -   Note that Section 6.1 talks about temporal difference (TD) updates for the value function V . We will instead be using TD updates for the Q-value Q .
            
        -   Don't worry about the references to Monte Carlo in Chapter 5.
            
-   [Q-Learning](https://link.springer.com/article/10.1007/BF00992698) : The original paper where Q-learning is first described. Kind of a hard read. Probably not worth bothering with.
    
-   [David Silver’s RL Lecture series](https://www.youtube.com/playlist?list=PLzuuYNsE1EZAXYR4FJ75jcJseBmo4KQ9-)
    
    -   Parts 1 and 2
        
-   [David Quarel’s RL lecture for ARENA](https://www.youtube.com/watch?v=Q2cFcq3I0G8) (I (David) delivered this exact same lecture on the day)
    
-   [The entire RL chapter of ARENA](https://learn.arena.education/chapter2_rl/)
    
-   [A good summary of SOTA algorithms for RL](https://aweers.de/blog/2026/rl-for-llms/)