---
title: Reward Learning Theory
cluster: A
summary: The theoretical foundations of reward learning — how RLHF can under
  strong assumptions recover an aligned objective, why underspecification and
  misspecification break that story, and how reward learning can be embedded
  into the framework of assistance games.
contributors:
  - Leon Lang (Iliad)
  - Joar Skalse (Deducto Limited, King's College London)
learningOutcomes:
  - Learn the place of reward learning in the AI alignment landscape
  - Reason about the (speculative) assumptions that motivate this research area
    as a whole
  - Understand the strong conditions under which the simplest reward learning
    technique, RLHF via trajectory comparisons, would lead to an aligned
    objective
  - Reason about situations where such conditions do not hold, like
    underspecification and misspecification, and efforts to account for, or
    correct, such issues
  - Understand how reward learning can be embedded into the framework of
    assistance games, which can be regarded as one conceptualization of the
    entire alignment problem
---
## Overview

Reward learning is a particular framework to approach the outer alignment problem: Instead of specifying a reward function directly, we learn such a function from observing human behavior, most prominently in the form of trajectory comparisons in reinforcement learning from human feedback (RLHF). We reason through the assumptions that motivate the research area as a whole, and we show that RLHF leads to an outer aligned objective under the strong additional condition of human Boltzmann rationality. We then discuss ways in which real humans deviate from such ideal conditions, including limited competence or capacity to evaluate complex behaviors for difficult tasks. We conclude by discussing how reward learning can be embedded into the framework of assistance games, which is one potential, but controversial, conceptualization of the entire alignment problem.

Leon Lang wrote and taught this module. Joar Skalse gave a guest lecture on the theory of reward learning.

## Prerequisites

It is useful to know the basics of reinforcement learning, reinforcement learning from human feedback (RLHF), and AI alignment, as taught in other modules of this course.

## Content

### Fast track

Simply read the lecture slides of the first lecture. If you have more time, do the first exercise on partial observability as one instantiation of misspecification.

### Main content

-   [Lecture slides: Intro to Reward Learning Theory by Leon Lang](https://docs.google.com/presentation/d/1cG2M68gj8osmrse97bza9cgCqqwoFXBvDIGpkv-HVgA/edit?usp=drive_link)
    
-   [Reward Learning Theory: Exercises](/uploads/reward-learning/porlhf_exercises_with_solutions.pdf) (also available [without solutions](/uploads/reward-learning/porlhf_exercises_WITHOUT_solutions.pdf)). The exercise sheet is based on the following two papers:
    
    -   [When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2402.17747)
        
    -   [Benefits of Assistance over Reward Learning](https://people.eecs.berkeley.edu/~russell/papers/neurips20ws-assistance)
        
-   [Lecture slides: Towards a Formal Theory of Reward Learning, With Application to Inverse Reinforcement Learning](/uploads/reward-learning/Joar_Skalse_-_Reward_Learning_Theory.pdf) by Joar Skalse. More context:
    
    -   [The Theoretical Reward Learning Research Agenda: Introduction and Motivation](https://www.lesswrong.com/s/TEybbkyHpMEB2HTv3/p/pJ3mDD7LfEwp3s5vG) by Joar Skalse
        

### Learn more

Here we list further readings, which are largely papers mentioned in Leon's lecture slides.

-   [Faulty reward functions in the wild](https://openai.com/index/faulty-reward-functions/?video=745142691): A basic reward specification problem.
    
-   Reward Learning methods and frameworks:
    
    -   [Deep reinforcement learning from human preferences](https://arxiv.org/abs/1706.03741) — the most basic and popular approach to reward learning
        
    -   [Reward-rational (implicit) choice: A unifying formalism for reward learning](https://arxiv.org/abs/2002.04833) — a generalization that contains RLHF as a special case
        
        -   [Algorithms for Inverse Reinforcement Learning](/uploads/reward-learning/icml00-irl.pdf): Another special case
            
        -   [Preferences Implicit in the State of the World](https://arxiv.org/abs/1902.04198): Another special case
            
        -   [Cooperative Inverse Reinforcement Learning](https://arxiv.org/abs/1606.03137): A *generalization*
            
        -   [Benefits of Assistance over Reward Learning](https://people.eecs.berkeley.edu/~russell/papers/neurips20ws-assistance): An adapted framework for said generalization
            
-   Underspecification and misspecification
    
    -   [Occam's razor is insufficient to infer the preferences of irrational agents](https://proceedings.neurips.cc/paper/2018/hash/d89a66c7c80a29b1bdbab0f2a1a94af8-Abstract.html): This is a case of an underspecification of the relationship between the human's reward function and the human's policy
        
    -   [When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2402.17747): In this misspecification, humans are assumed to fully observe the environment even if they only do so partially.
        
    -   [Modeling Human Beliefs about AI Behavior for Scalable Oversight](https://arxiv.org/abs/2502.21262): An approach to correct for the previous misspecification via human models; this can then, however, sometimes lead to an *underspecification* in which the reward function is too uncertain to be safely learned.
        
    -   [AI Alignment with Changeable and Influenceable Reward Functions](https://arxiv.org/abs/2405.17713): This paper breaks with the typical assumption of a fixed reward function.
        
    -   [Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback](https://arxiv.org/abs/2404.10271): This paper breaks with the typical assumption of a *single* reward function.
        
        -   Potential answer: [Collective Constitutional AI: Aligning a Language Model with Public Input](https://www.anthropic.com/research/collective-constitutional-ai-aligning-a-language-model-with-public-input)
            
-   On how reward learning falls within AI alignment:
    
    -   Alignment targets
        
        -   [OpenAI Model Spec](https://model-spec.openai.com/2025-12-18.html)
            
        -   [Claude's new constitution](https://www.anthropic.com/news/claude-new-constitution)
            
        -   [Coherent Extrapolated Volition](/uploads/reward-learning/CEV.pdf)
            
    -   The reward learning agenda assumes that human values can be expressed via reward functions. However, some papers argue:
        
        -   [The Reward Hypothesis is False](https://openreview.net/forum?id=5l1NgpzAfH)
            
    -   If human values are captured by a reward function, it is still contentious that we should attempt to *decompose* the alignment problem into first learning said reward function and then optimizing it:
        
        -   [Inner and outer alignment decompose one hard problem into two extremely hard problems](https://www.lesswrong.com/posts/gHefoxiznGfsbiAu9/inner-and-outer-alignment-decompose-one-hard-problem-into)
            
        -   [Reward is not the optimization target](https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target)
            
    -   Even if we'd solve outer alignment via reward learning, we'd still be left with the inner alignment problem of finding a policy that "cares for" this objective:
        
        -   [Risks from Learned Optimization in Advanced Machine Learning Systems](https://arxiv.org/abs/1906.01820)
            
        -   [Reinforcement Learning textbook](/uploads/reward-learning/RLbook2020.pdf): This book is on reinforcement learning, which can be regarded as a very basic conceptualization of the *inner* alignment problem
            
        -   [Goal misgeneralization in Deep Reinforcement Learning](https://arxiv.org/abs/2105.14111)
            
-   Learn more also in Joar Skalse's sequence on [the theoretical foundations of reward learning](https://www.lesswrong.com/s/TEybbkyHpMEB2HTv3)
    
-   I can also recommend reading [the work of Anca Dragan](https://scholar.google.com/citations?user=UgHB5oAAAAAJ&hl=en) and her many students on reward learning.
