Iliad Intensive Curriculum

Overview

Reward learning is a particular framework to approach the outer alignment problem: Instead of specifying a reward function directly, we learn such a function from observing human behavior, most prominently in the form of trajectory comparisons in reinforcement learning from human feedback (RLHF). We reason through the assumptions that motivate the research area as a whole, and we show that RLHF leads to an outer aligned objective under the strong additional condition of human Boltzmann rationality. We then discuss ways in which real humans deviate from such ideal conditions, including limited competence or capacity to evaluate complex behaviors for difficult tasks. We conclude by discussing how reward learning can be embedded into the framework of assistance games, which is one potential, but controversial, conceptualization of the entire alignment problem.

Leon Lang wrote and taught this module. Joar Skalse gave a guest lecture on the theory of reward learning.

Prerequisites

It is useful to know the basics of reinforcement learning, reinforcement learning from human feedback (RLHF), and AI alignment, as taught in other modules of this course.

Content

Fast track

Simply read the lecture slides of the first lecture. If you have more time, do the first exercise on partial observability as one instantiation of misspecification.

Main content

Lecture slides: Intro to Reward Learning Theory by Leon Lang
Reward Learning Theory: Exercises (also available without solutions). The exercise sheet is based on the following two papers:
- When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback
- Benefits of Assistance over Reward Learning
Lecture slides: Towards a Formal Theory of Reward Learning, With Application to Inverse Reinforcement Learning by Joar Skalse. More context:
- The Theoretical Reward Learning Research Agenda: Introduction and Motivation by Joar Skalse

Learn more

Here we list further readings, which are largely papers mentioned in Leon's lecture slides.

Faulty reward functions in the wild: A basic reward specification problem.
Reward Learning methods and frameworks:
- Deep reinforcement learning from human preferences — the most basic and popular approach to reward learning
- Reward-rational (implicit) choice: A unifying formalism for reward learning — a generalization that contains RLHF as a special case
  - Algorithms for Inverse Reinforcement Learning: Another special case
  - Preferences Implicit in the State of the World: Another special case
  - Cooperative Inverse Reinforcement Learning: A generalization
  - Benefits of Assistance over Reward Learning: An adapted framework for said generalization
Underspecification and misspecification
- Occam's razor is insufficient to infer the preferences of irrational agents: This is a case of an underspecification of the relationship between the human's reward function and the human's policy
- When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback: In this misspecification, humans are assumed to fully observe the environment even if they only do so partially.
- Modeling Human Beliefs about AI Behavior for Scalable Oversight: An approach to correct for the previous misspecification via human models; this can then, however, sometimes lead to an underspecification in which the reward function is too uncertain to be safely learned.
- AI Alignment with Changeable and Influenceable Reward Functions: This paper breaks with the typical assumption of a fixed reward function.
- Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback: This paper breaks with the typical assumption of a single reward function.
  - Potential answer: Collective Constitutional AI: Aligning a Language Model with Public Input
On how reward learning falls within AI alignment:
- Alignment targets
- The reward learning agenda assumes that human values can be expressed via reward functions. However, some papers argue:
  - The Reward Hypothesis is False
- If human values are captured by a reward function, it is still contentious that we should attempt to decompose the alignment problem into first learning said reward function and then optimizing it:
  - Inner and outer alignment decompose one hard problem into two extremely hard problems
  - Reward is not the optimization target
- Even if we'd solve outer alignment via reward learning, we'd still be left with the inner alignment problem of finding a policy that "cares for" this objective:
  - Risks from Learned Optimization in Advanced Machine Learning Systems
  - Reinforcement Learning textbook: This book is on reinforcement learning, which can be regarded as a very basic conceptualization of the inner alignment problem
  - Goal misgeneralization in Deep Reinforcement Learning
Learn more also in Joar Skalse's sequence on the theoretical foundations of reward learning
I can also recommend reading the work of Anca Dragan and her many students on reward learning.