---
title: Worst-Case Interpretability
cluster: E
contributors:
  - Louis Jaburi (EleutherAI)
summary: The limits of average-case interpretability evaluation, compact proofs
  as an idealised alternative, and ARC's heuristic arguments agenda as a
  potentially tractable middle path.
learningOutcomes:
  - Understand the difficulty of evaluating and validating interpretations.
  - Formalize interpretations as lossy compressions with quantifiable
    faithfulness and compactness, and identify the three sources of exponential
    blowup (error along paths, number of paths, input space coverage).
  - Articulate why worst-case bounds become vacuous even on simple models, and
    why this motivates an intermediate approach between worst-case and
    average-case evaluation.
  - Describe at a high level ARC's heuristic arguments agenda as a response to
    these limitations.
---
## Overview

Any interpretation of a neural network is a lossy compression: it trades the model's full complexity for a shorter, human-understandable description. How to formally deal with the resulting approximation error is unclear. We discuss proofs as a rigorous, but pessimistic, approach that immediately yields quantifiable metrics for faithfulness and compactness. A concrete example shows that worst-case pessimizations are too strict to be scalable and will lead to vacuous bounds even on simple models. This suggests that neither pure worst-case nor pure average-case evaluation is adequate on its own, opening the door for an intermediate approach. We conclude by connecting this to ARC's agenda of heuristic arguments.

## Prerequisites

-   Linear algebra (SVD, eigendecomposition, matrix norms, singular values)
    
-   From the mechanistic interpretability module: Circuits, ablations, SAEs. More generally transformers.
    
-   Basic information + probability theory
    

## Content

### Fast track

-   Read the [intro notebook](https://colab.research.google.com/github/LouisYRYJ/Proof_based_approach_tutorial/blob/master/proof_public.ipynb) and the [ARC blog post](https://www.lesswrong.com/posts/SyeQjjBoEC48MvnQC/formal-verification-heuristic-explanations-and-surprise).
    
-   Read the [Lecture slides](/uploads/worst-case-interpretability/slides.pdf).
    

### Main content

**1\. Limitations of interpretability evaluation.** Current interpretability methods are typically evaluated using faithfulness metrics based on ablations. These metrics turn out to be fragile.

-   Read ['Transformer Circuit Faithfulness Metrics Are Not Robust'](https://arxiv.org/abs/2407.08734).
    

**2\. Compact proofs framework.** If average-case metrics are unreliable, can we do better? Proofs offer a rigorous alternative: formalize an interpretation as a compression of the model with quantifiable faithfulness and compactness, then prove guarantees. The compact proofs paper shows what happens when you try in a toy model.

-   Read: [Compact proofs paper](https://arxiv.org/abs/2406.11779), the main paper.
    
-   Work through the [intro notebook](https://colab.research.google.com/github/LouisYRYJ/Proof_based_approach_tutorial/blob/master/proof_public.ipynb).
    

**3\. From proofs to heuristic arguments.** The vacuousness of worst-case bounds suggests we need something between full proofs and mere average-case evaluation. ARC's heuristic arguments agenda proposes such an intermediate approach.

-   Read: above mentioned [blog post](https://www.lesswrong.com/posts/SyeQjjBoEC48MvnQC/formal-verification-heuristic-explanations-and-surprise).
    
-   [ARC's general agenda](https://www.lesswrong.com/posts/ztokaf9harKTmRcn4/a-bird-s-eye-view-of-arc-s-research).
    
-   Recent progress: [no-coincidence principle](https://www.lesswrong.com/posts/Xt9r4SNNuYxW83tmo/a-computational-no-coincidence-principle).
    

### Learn more

-   More examples of worst-case interp on toy models: [Learning group operations](https://arxiv.org/abs/2410.07476); [modular addition](https://arxiv.org/abs/2412.03773).
    
-   Deeper look into ARC's work: [Low probability estimation](https://arxiv.org/abs/2410.13211).
    
-   Critiques of ARC work: e.g. [this series](https://www.lesswrong.com/s/uYMw689vDFmgPEHrS).