Iliad Intensive Curriculum

Overview

Any interpretation of a neural network is a lossy compression: it trades the model's full complexity for a shorter, human-understandable description. How to formally deal with the resulting approximation error is unclear. We discuss proofs as a rigorous, but pessimistic, approach that immediately yields quantifiable metrics for faithfulness and compactness. A concrete example shows that worst-case pessimizations are too strict to be scalable and will lead to vacuous bounds even on simple models. This suggests that neither pure worst-case nor pure average-case evaluation is adequate on its own, opening the door for an intermediate approach. We conclude by connecting this to ARC's agenda of heuristic arguments.

Prerequisites

Linear algebra (SVD, eigendecomposition, matrix norms, singular values)
From the mechanistic interpretability module: Circuits, ablations, SAEs. More generally transformers.
Basic information + probability theory

Content

Fast track

Read the intro notebook and the ARC blog post.
Read the Lecture slides.

Main content

1. Limitations of interpretability evaluation. Current interpretability methods are typically evaluated using faithfulness metrics based on ablations. These metrics turn out to be fragile.

Read 'Transformer Circuit Faithfulness Metrics Are Not Robust'.

2. Compact proofs framework. If average-case metrics are unreliable, can we do better? Proofs offer a rigorous alternative: formalize an interpretation as a compression of the model with quantifiable faithfulness and compactness, then prove guarantees. The compact proofs paper shows what happens when you try in a toy model.

Read: Compact proofs paper, the main paper.
Work through the intro notebook.

3. From proofs to heuristic arguments. The vacuousness of worst-case bounds suggests we need something between full proofs and mere average-case evaluation. ARC's heuristic arguments agenda proposes such an intermediate approach.

Read: above mentioned blog post.
ARC's general agenda.
Recent progress: no-coincidence principle.

Learn more

More examples of worst-case interp on toy models: Learning group operations; modular addition.
Deeper look into ARC's work: Low probability estimation.
Critiques of ARC work: e.g. this series.

Worst-Case Interpretability