Iliad Intensive Curriculum

Overview

We take a look at the science of reverse engineering the learned structures of a neural network. By going through the history of mechanistic interpretability, we examine the methods used to identify features and circuits, the ontology and assumptions underlying those methods, and we sample the main findings that mechinterp has made so far. We start with early work on CNNs and then turn to transformers, looking specifically at linear probes, steering vectors, and directional ablations; superposition and sparse autoencoders, including their compressed sensing motivation, evaluation, and known failure modes like feature absorption and splitting; and circuit discovery via logit attribution, the logit lens, path patching, ACDC, attribution graphs, and causal scrubbing. Throughout, we include hands-on exercises on feature visualization, logit lens, induction heads, SAE training, and exploration of Neuronpedia. The day closes with a group discussion of major critiques of the field.

Prerequisites

Knowing the basics of Machine Learning
Familiarity with the Transformer Architecture
Basic maths / linear algebra

Content

Fast track

Go through the slides below.

Main content

Slides

Exercises:

Reading:

Learn more

Selected readings

Motivation and Foundations:

Goal of mechanistic interpretability and its relevance for AI safety
AI safety ontology: features and circuits as the building blocks of neural network understanding (Olah et al., 2020)

Historical MechInterp:

Feature visualization and saliency maps in CNNs
Circuits in CNNs: how features compose into higher-level detectors (Cammarata et al., 2020)

MechInterp in Transformers – What is a feature?

Linear directions in activation space encode interpretable meanings (Mikolov et al., 2013)
The linear representation hypothesis holds in LLMs (Park et al., 2024)
Linear probes can read off internal representations, e.g. world models in Othello-GPT (Nanda et al., 2023)
Steering vectors can causally influence model behavior (Panickssery et al., 2024)
Ablation of single directions can remove capabilities like refusal (Arditi et al., 2024)

MechInterp in Transformers – Superposition and SAEs:

Toy models show features are packed into fewer dimensions via superposition (Elhage et al., 2022)
Mathematical background: compressed sensing motivates why sparse recovery works (Candes et al., 2004; Donoho, 2006)
Sparse autoencoders decompose superposed representations into interpretable features (Bricken et al., 2023)
Evaluating SAE quality across architectures and metrics (Karvonen et al., 2025)
Crosscoders: SAE variants that read/write across layers, enabling model diffing (Lindsey et al., 2024)
Feature geometry, absorption, and splitting: failure modes of SAE feature learning (Chanin et al., 2024)
Activation oracles: using LLMs to automatically label and explain SAE features (Karvonen et al., 2025)

Circuit Discovery:

Success story: fully reverse-engineering a grokking model doing modular arithmetic (Nanda et al., 2023)
Logit attribution: decomposing outputs by component, discovering induction heads (Elhage et al., 2021)
The logit lens: projecting intermediate residual stream states through the unembedding (nostalgebraist, 2020)
Path patching: tracing causal pathways, discovering the IOI circuit (Wang et al., 2022)
ACDC: automated circuit discovery via activation patching (Conmy et al., 2023)
Attribution graphs: scalable circuit tracing revealing circuits like rhyming (Lindsey et al., 2025)
Causal scrubbing: a rigorous framework for validating circuit explanations (Chan et al., 2022)
Is circuit discovery NP-hard? Computational complexity of finding circuits (Adolfi et al., 2025)

Outlook:

Real-world use: Claude Opus 4.5 system card used interp to investigate deceptive behavior (Anthropic, 2025)
Open problems in mechanistic interpretability (Sharkey et al., 2025)

Critiques of MechInterp:

Pivoting from ambitious reverse-engineering to pragmatic, tool-oriented interp (Nanda et al., 2025)
Against almost every theory of impact of interpretability (de Segerie, 2023)
The misguided quest for mechanistic AI interpretability (Hendrycks, 2025)
Activation space interpretability may be doomed: fundamental limits of activation-based methods (Chughtai, 2025)

Mechanistic Interpretability