Iliad

Data Attribution

Cluster BDownload .md

Data attribution for alignment via three connected frameworks — influence functions, Bayesian influence functions, and unrolling — for measuring how reweighting training points counterfactually affects model behavior.

By Louis Jaburi (EleutherAI)

What you’ll learn

  • Internalize the importance of data attribution for alignment (see papers in the slides)
  • Understand the chain of approximations causality → counterfactual → Shapley values → LOO → first-order approximation of perturbations
  • Can frame data attribution methods (IF, BIF, Unrolling) in terms of the map B → W at the end of Section 1
  • Can explain what IF, BIF, and unrolling are (in terms of above map). Highlight their limitations
  • Are able to compute influence scores, if given a magical oracle that turns mathematical formulas into clean ML code

Overview

We turn the focus from the weight space to training data. We pose the question: how can we measure which training examples cause which model behaviors? After discussing the general role of data attribution for the purpose of alignment, we frame this as a technical question: how can we understand the counterfactual impact of perturbing, specifically reweighting, individual data points? We then develop three frameworks that each make the problem tractable by interpreting the map from data to trained model differently: influence functions (as an implicit function of data weights at a unique minimum), Bayesian influence functions (as a posterior distribution over parameters), and unrolling (as a concrete optimization trajectory). These turn out to be closely connected: influence functions emerge as a limiting case of both alternatives, and the degeneracy phenomena studied on the SLT day reappear in understanding where and why the classical theory breaks down.

Prerequisites

  • Seeing SLT day before is valuable.

  • Mechinterp + Training dynamic day useful to have, but not essential.

  • Technical knowledge: See 'Prerequisites' in the lecture notes.

Content

Fast track

Chapter 1 is a general introduction. Section 1.4 is important, rest could be skipped. Then depending on interest, either read the first part of each next chapter (IF, BIF, Unrolling) and/or dive deeper into the ones that you find interesting.

Main content

Learn more

Do all of the exercises and/or look up the references of these.

Influence functions

On damping

Bayesian influence functions (& susceptibilities)

Unrolling