Iliad Intensive Curriculum

Overview

We turn the focus from the weight space to training data. We pose the question: how can we measure which training examples cause which model behaviors? After discussing the general role of data attribution for the purpose of alignment, we frame this as a technical question: how can we understand the counterfactual impact of perturbing, specifically reweighting, individual data points? We then develop three frameworks that each make the problem tractable by interpreting the map from data to trained model differently: influence functions (as an implicit function of data weights at a unique minimum), Bayesian influence functions (as a posterior distribution over parameters), and unrolling (as a concrete optimization trajectory). These turn out to be closely connected: influence functions emerge as a limiting case of both alternatives, and the degeneracy phenomena studied on the SLT day reappear in understanding where and why the classical theory breaks down.

Prerequisites

Seeing SLT day before is valuable.
Mechinterp + Training dynamic day useful to have, but not essential.
Technical knowledge: See 'Prerequisites' in the lecture notes.

Content

Fast track

Chapter 1 is a general introduction. Section 1.4 is important, rest could be skipped. Then depending on interest, either read the first part of each next chapter (IF, BIF, Unrolling) and/or dive deeper into the ones that you find interesting.

Data Attribution

What you’ll learn

Overview

Prerequisites

Content

Fast track

Main content

Learn more

Influence functions

On damping

Bayesian influence functions (& susceptibilities)

Unrolling