Iliad Intensive Curriculum

Overview

A central challenge in alignment is that we must reason about the behavior of superintelligent agents before such systems exist and before we can learn from mistakes. This module develops formal tools for doing so across several agendas: coherence arguments and the Complete class theorem, Löb's theorem and the Löbian obstacle to safe self-modification, tiling agents and Vingean reflection, logical induction and reasoning under logical uncertainty, functional and updateless decision theory, and the thermodynamics of optimization.

Daniel Chiang and Satya Benson created this day’s content. Daniel taught it in-person. Cole Wyeth and Aram Ebtekar gave guest lectures.

Prerequisites

Content on AIXI (idealised agency day) should already be known.
A wide variety of background knowledge is helpful for understanding different agenda: Probability theory is helpful for understanding money-pumps and complete class theorems; Formal logic is helpful for understanding tiling agents and Vingean reflection; Computability theory is helpful for understanding reflective oracles; Information theory, causality and statistical mechanics is helpful for understanding the section on descriptive agent foundations.
In general, understanding the motivation of agent foundations research (such as the rocket alignment problem) is out of scope.

Content

Fast track

To achieve the most important learning outcomes within one hour, read:

Embedded agency — read up to and including section 3.3, then read section 4.1.
Why agent foundations
Reflective consistent degree of freedom

Main content

Hosted on the website which contains a "Why agent foundations" section (different from the post above named the same) giving motivation and introduction. The website also contains some refreshers to mathematical prerequisites. These aren’t included in the links below, which otherwise contain all the website’s agent foundations content, and additional slides that are not found on the website.

Start with the lecture slides which give a bird's eye view of the agent foundations landscape and what the key open problems are. Then continue with the following fundamental readings:

Embedded agency — read up to and including section 3.3, then read section 4.1.
Why agent foundations — read entirely.
Reflective consistent degree of freedom — read entirely.
General Purpose Search — read entirely.

Next, do the agency exercises.

Now, choose one of the following six topics and read the corresponding reading list.

Consequentialist Foundations

Description: One of the central challenges in aligning a future AI is that we have to reason about systems that don't yet exist, potentially far more capable than anything we've seen.. Coherence arguments give us an initial foothold. The basic insight is that if an agent's preferences are inconsistent — say, it prefers A to B, B to C, and C to A — a clever adversary can cycle it through trades that each look locally acceptable but leave it strictly worse off, wasting resources for no gain. Any agent that reliably avoids such "dominated strategies" must behave as if it has consistent preferences representable by a utility function. The Complete Class Theorem sharpens this: any decision strategy that is Pareto-optimal across all possible environments must be equivalent to maximizing expected utility under some probability distribution. Together, these results suggest that sufficiently capable, non-self-defeating agents will generically look like Bayesian expected utility maximizers from the outside.

Reading list:

Coherent decisions imply consistent utilities - Read the following sections:
- "Introduction", "Why not circular preferences?"
- Read "Probabilities and expected utilities" up to and including the "Conditional probability" subsection
- Read "Conclusion"
The measuring stick of utility- Read Entirely
Complete class: Consequentialist foundations- Read Entirely

Lob’s theorem and tiling agent

Description: Any sufficiently advanced AI may eventually be able to modify its own code or build successor systems more capable than itself. But this raises a subtle problem: if the successor is genuinely smarter, the original agent cannot predict exactly what it will do. This means the original agent cannot verify that the successor is safe by simply simulating it — instead, it must reason abstractly about the properties of the successor's design, guaranteeing that any action the successor takes will fall within acceptable bounds, without knowing which specific action that will be. Tiling agent research formalizes this challenge in the language of mathematical logic, where it immediately runs into a fundamental obstacle. A theorem known as Löb's theorem implies that a formal system cannot, in general, trust the reasoning of another system of equal or greater logical strength — it can only trust strictly weaker systems. This creates a problem for self-improvement: an agent trying to verify its successor's reasoning appears forced to use a less powerful proof system at each step, making indefinitely long chains of safe self-modification seemingly impossible.

Reading list:

Introduction to Lob’s theorem - Read up to and including section 3
Vingean reflection - Read Entirely
Walkthrough of Tiling agent - Read from the start up to and including "Finite Descent Problem", then read the "What self-modifying agents need" section

Logical induction

Description: Traditional Bayesian reasoning assumes logical omniscience: the agent already knows all the consequences of its beliefs and can instantly perform any computation at no cost. But a computationally bounded agent cannot do this — for instance, it may know the full source code of a program yet still be uncertain what the program outputs, or know the axioms of arithmetic yet be uncertain whether a given large number is prime. Logical induction extends the Dutch book argument underlying Bayesianism to handle this kind of logical uncertainty. Rather than requiring that no Dutch book exists at all, the logical induction criterion requires only that no efficiently computable trading strategy can exploit the agent for unbounded profit. This weaker but computationally realistic requirement turns out to be sufficient to derive many desirable properties: prices converge to coherent probabilities in the limit, the agent learns to respect logical relationships in a timely manner, and it assigns appropriate probabilities to statements that are computationally hard to evaluate even when their truth is in principle determined.

Reading list:

An intuitive introduction to Garrabrant Induction - Read Entirely
Logical induction paper - Read Chapters 1 and 3, Skim Chapter 4

Decision theory

Description: For a dualistic agent with a well-defined environment, optimization is straightforward: the agent simply picks whichever action maximizes expected utility. The problem for embedded agents is that their action is just another fact about the world, so there is no well-defined notion of "what would happen if I took a different action." Different decision theories correspond to different ways of constructing these counterfactuals: conditioning on the action as evidence (EDT), intervening causally on it (CDT), or reasoning about the logical consequences of being the kind of agent that runs a given decision procedure (FDT). Standard decision theories fail in characteristic ways when facing agents who can predict them or logical correlations between their action and the world. A particularly important desideratum is reflective consistency**: an agent reflecting on its own decision procedure should endorse** it, in the sense that it would not wish to self-modify into an agent running a different procedure. This matters because a decision theory is a reflectively consistent degree of freedom — an agent using a "bad" decision theory will not automatically correct this flaw as it becomes more capable, in the way it might automatically correct a mistaken factual belief. Functional and updateless decision theories have been developed to address these failure modes, with the aim of specifying a decision procedure that an agent would reflectively endorse and stably preserve across self-modification.

Reading list:

Function decision theory- Read Chapter 1-5
- Alternatively, read the An intuitive introduction to functional decision theory sequence if you find the above confusing
Updateless Decision theory - Read Entirely
- You may reference this post for more details

Optimization and Thermodynamics

Description: A useful way to characterize powerful agents is that they reliably steer the world into a narrow region of outcome space — one that would be extremely unlikely to arise under any random process. Intuitively, optimization is a form of local entropy reduction: an agent concentrating probability mass from a broad initial distribution over possible states into a tight final distribution around a convergent target.This connects naturally to thermodynamics; many of the concepts relevant to optimization such as entropy production, fluctuation theorems, the thermodynamic cost of information erasure, are already well-developed in frameworks such as stochastic thermodynamics, and hold far from equilibrium. Algorithmic thermodynamics (Ebtekar and Hutter) extends this by replacing Gibbs-Shannon entropy with Kolmogorov complexity, yielding laws that apply to individual states rather than ensembles. Much of the value here, especially for readers with a physics background, is less in new results than in translation: Reframing familiar thermodynamic intuitions in the language of information theory connects them to concepts native to agent foundations, such as world-modeling, optimization power and the physical constraints on embedded agents.

Reading list:

The ground of optimization- Read up to and including the "Relationship to Garrabrant and Demski’s Embedded Agency" section
Generalized heat engine - Read Entirely
Algorithmic thermodynamics and 3 types of optimization - Read Entirely
Optional: When bits of optimization implies bits of modelling

Descriptive Agent foundations

Description: While much of agent foundations focuses on normative questions about what an ideal, perfectly rational agent should look like, descriptive agent foundations instead asks what kinds of agents actually arise in the physical world and how to recognize them. The goal is to develop a framework that can take any system, from bacteria to neural networks to future AI systems, and identify its goals, world-model, and decision-making structure. This involves formalizing core components of agency such as optimization, meaning processes that reliably steer the world toward a narrow set of outcomes, as well as world-models and general-purpose search, while grounding them in physical constraints like modularity and computational limitations. Methodologically, descriptive foundations work bottom-up: rather than starting from ideal principles, they begin with properties of the world such as selection pressures and resource bounds, and derive what kinds of agent-like structures are likely to emerge. Selection theorems aim to formalize which agent properties are favored by these pressures, with a focus on giving a more mechanistic account of beliefs, goals, and decision-making beyond treating agents as abstract utility maximizers. Together, this line of research seeks to characterize the type signature of real-world agents and explain why agents with certain structures arise in the first place.

Reading list:

Selection theorems - Read Entirely
How we picture bayesian agents - Read Entirely
What selection theorems do we expect/want - Read Entirely

Learn more

You may want to read the lecture slides by Cole and Aram that were taught in the in-person component:

Embedded AIXI slides: Guest lecture by Cole Wyeth
Decision theory slides: Guest lecture by Aram Ebtekar

To go deeper, read any other topic readings that interest you from the main content above. The additional readings below either extend one of the topic readings in more depth or cover agent foundations topics not introduced in the topic readings, such as infra-Bayesianism and corrigibility.

Agent Foundations

What you’ll learn

Overview

Prerequisites

Content

Fast track

Main content

Consequentialist Foundations

Lob’s theorem and tiling agent

Logical induction

Decision theory

Optimization and Thermodynamics

Descriptive Agent foundations

Learn more