Iliad

Worst-Case Interpretability

Cluster EDownload .md

The limits of average-case interpretability evaluation, compact proofs as an idealised alternative, and ARC's heuristic arguments agenda as a potentially tractable middle path.

By Louis Jaburi (EleutherAI)

What you’ll learn

  • Understand the difficulty of evaluating and validating interpretations.
  • Formalize interpretations as lossy compressions with quantifiable faithfulness and compactness, and identify the three sources of exponential blowup (error along paths, number of paths, input space coverage).
  • Articulate why worst-case bounds become vacuous even on simple models, and why this motivates an intermediate approach between worst-case and average-case evaluation.
  • Describe at a high level ARC's heuristic arguments agenda as a response to these limitations.

Overview

Any interpretation of a neural network is a lossy compression: it trades the model's full complexity for a shorter, human-understandable description. How to formally deal with the resulting approximation error is unclear. We discuss proofs as a rigorous, but pessimistic, approach that immediately yields quantifiable metrics for faithfulness and compactness. A concrete example shows that worst-case pessimizations are too strict to be scalable and will lead to vacuous bounds even on simple models. This suggests that neither pure worst-case nor pure average-case evaluation is adequate on its own, opening the door for an intermediate approach. We conclude by connecting this to ARC's agenda of heuristic arguments.

Prerequisites

  • Linear algebra (SVD, eigendecomposition, matrix norms, singular values)

  • From the mechanistic interpretability module: Circuits, ablations, SAEs. More generally transformers.

  • Basic information + probability theory

Content

Fast track

Main content

1. Limitations of interpretability evaluation. Current interpretability methods are typically evaluated using faithfulness metrics based on ablations. These metrics turn out to be fragile.

2. Compact proofs framework. If average-case metrics are unreliable, can we do better? Proofs offer a rigorous alternative: formalize an interpretation as a compression of the model with quantifiable faithfulness and compactness, then prove guarantees. The compact proofs paper shows what happens when you try in a toy model.

3. From proofs to heuristic arguments. The vacuousness of worst-case bounds suggests we need something between full proofs and mere average-case evaluation. ARC's heuristic arguments agenda proposes such an intermediate approach.

Learn more