Reinforcement Learning
Foundations of reinforcement learning — Markov decision processes, the Bellman equation, policy improvement, and the SARSA/Q-learning algorithms, with parallel empirical and theory tracks.
By David Quarel (Australian National University), Leon Lang (Iliad)
What you’ll learn
- Understand Markov Decision Processes and the goal of the agent, for known environments.
- Understand the Bellman equation.
- Understand the policy improvement theorem, and how we can use it to iteratively solve for an optimal policy.
- Empirical track: Understand the anatomy of a gym.Env, so that you feel comfortable using them and writing your own.
- Empirical track: Understand SARSA and Q-learning and the difference between on-policy and off-policy methods.
- Empirical track: Implement SARSA and Q-Learning, and compare them on different environments.
- Empirical track: Understand the TD(λ) algorithm, and how it can we used to mix over short and long timescale updates.
- Theory track: Prove properties involving the Bellman equations, including the existence of optimal policies, the policy improvement theorem, rate of convergence of Bellman updates, and the convergence of Q-learning.
Overview
We provide a brief introduction to Reinforcement Learning from the fundamentals, covering tabular RL (chapters 2-4 of Sutton and Barto) in two streams. The empirical stream directly follows Day 1 of ARENA and covers implementing policy iteration/evaluation, Q-learning, and SARSA for toy gridworld environments in Python. The theory stream proves a series of results including the Bellman equations, the convergence of policy iteration and its rate, and the convergence of Q-learning, and derives an analytic solution to the Bellman equation.
-
Theory Workshop material produced by Leon Lang and David Quarel
-
Lecture slides taken from the ARENA program (authored by David Quarel)
-
Empirical exercises taken from Day 1 of the ARENA program
-
Content was delivered by David Quarel
Prerequisites
-
No prerequisites are assumed. All the RL material is self-contained.
-
The theory track requires familiarity with proofs.
Content
Fast track
Pretty hard to fast track it, best approach I would give is to read and understand the lecture slides, and:
-
Empirical track: Run the code, and understand the solutions for policy iteration/improvement (not the exact form), and read/run Q-learning
-
Theory track: Just read the solutions for Problems 1 and 4 and make sure you understand each step
Main content
-
- This has everything that both empirical and theory track should need for reference.
-
Empirical track: Work through ARENA RL Day 1
-
Theory track works through a problem sheet
(static compiled versions of lecture slides + theory exercises here)
Learn more
-
-
Chapter 3: Sections 3.1, 3.2, 3.3, 3.4, 3.5, 3.6
-
Chapter 4: Sections 4.1, 4.2, 4.3, 4.4
-
Chapter 6, Section 6.1, 6.3 (Especially Example 6.4)
-
Note that Section 6.1 talks about temporal difference (TD) updates for the value function V . We will instead be using TD updates for the Q-value Q .
-
Don't worry about the references to Monte Carlo in Chapter 5.
-
-
-
Q-Learning : The original paper where Q-learning is first described. Kind of a hard read. Probably not worth bothering with.
-
David Silver’s RL Lecture series
- Parts 1 and 2
-
David Quarel’s RL lecture for ARENA (I (David) delivered this exact same lecture on the day)