Iliad

Reward Learning Theory

Cluster ADownload .md

The theoretical foundations of reward learning — how RLHF can under strong assumptions recover an aligned objective, why underspecification and misspecification break that story, and how reward learning can be embedded into the framework of assistance games.

By Leon Lang (Iliad), Joar Skalse (Deducto Limited, King's College London)

What you’ll learn

  • Learn the place of reward learning in the AI alignment landscape
  • Reason about the (speculative) assumptions that motivate this research area as a whole
  • Understand the strong conditions under which the simplest reward learning technique, RLHF via trajectory comparisons, would lead to an aligned objective
  • Reason about situations where such conditions do not hold, like underspecification and misspecification, and efforts to account for, or correct, such issues
  • Understand how reward learning can be embedded into the framework of assistance games, which can be regarded as one conceptualization of the entire alignment problem

Overview

Reward learning is a particular framework to approach the outer alignment problem: Instead of specifying a reward function directly, we learn such a function from observing human behavior, most prominently in the form of trajectory comparisons in reinforcement learning from human feedback (RLHF). We reason through the assumptions that motivate the research area as a whole, and we show that RLHF leads to an outer aligned objective under the strong additional condition of human Boltzmann rationality. We then discuss ways in which real humans deviate from such ideal conditions, including limited competence or capacity to evaluate complex behaviors for difficult tasks. We conclude by discussing how reward learning can be embedded into the framework of assistance games, which is one potential, but controversial, conceptualization of the entire alignment problem.

Leon Lang wrote and taught this module. Joar Skalse gave a guest lecture on the theory of reward learning.

Prerequisites

It is useful to know the basics of reinforcement learning, reinforcement learning from human feedback (RLHF), and AI alignment, as taught in other modules of this course.

Content

Fast track

Simply read the lecture slides of the first lecture. If you have more time, do the first exercise on partial observability as one instantiation of misspecification.

Main content

Learn more

Here we list further readings, which are largely papers mentioned in Leon's lecture slides.