Mechanistic Interpretability
Mechanistic interpretability for neural networks — features and circuits, methods for discovery, the frontier of knowledge, and critiques of the field.
By Julian Schulz (Meridian Research)
What you’ll learn
- Understand the goals of mechanistic interpretability
- Understand what is meant by feature and circuit
- Have an overview of currently used methods for feature and circuit discovery
- Have an overview of the current capabilities and frontier of knowledge in mechanistic interpretability
- Engage with and understand common critiques of mechanistic interpretability
- Implement common mechinterp methods
- Know when to apply what method
- Read and fully understand new mechanistic interpretability publications
- Be a useful discussion partner to mechanistic interpretability researchers
Overview
We take a look at the science of reverse engineering the learned structures of a neural network. By going through the history of mechanistic interpretability, we examine the methods used to identify features and circuits, the ontology and assumptions underlying those methods, and we sample the main findings that mechinterp has made so far. We start with early work on CNNs and then turn to transformers, looking specifically at linear probes, steering vectors, and directional ablations; superposition and sparse autoencoders, including their compressed sensing motivation, evaluation, and known failure modes like feature absorption and splitting; and circuit discovery via logit attribution, the logit lens, path patching, ACDC, attribution graphs, and causal scrubbing. Throughout, we include hands-on exercises on feature visualization, logit lens, induction heads, SAE training, and exploration of Neuronpedia. The day closes with a group discussion of major critiques of the field.
Prerequisites
-
Knowing the basics of Machine Learning
-
Familiarity with the Transformer Architecture
-
Basic maths / linear algebra
Content
Fast track
Go through the slides below.
Main content
Exercises:
Reading:
-
A Pragmatic Vision for Interpretability (Nanda et al., 2025) ~15 min
-
Against Almost Every Theory of Impact of Interpretability (Segerie, 2023) ~20 min
-
The Misguided Quest for Mechanistic AI Interpretability (Hendrycks, 2025) ~20 min
-
Activation Space Interpretability May Be Doomed (Chughtai & Bushnaq, 2025) ~15 min
Learn more
Selected readings
Motivation and Foundations:
-
Goal of mechanistic interpretability and its relevance for AI safety
-
AI safety ontology: features and circuits as the building blocks of neural network understanding (Olah et al., 2020)
Historical MechInterp:
-
Feature visualization and saliency maps in CNNs
-
Circuits in CNNs: how features compose into higher-level detectors (Cammarata et al., 2020)
MechInterp in Transformers – What is a feature?
-
Linear directions in activation space encode interpretable meanings (Mikolov et al., 2013)
-
The linear representation hypothesis holds in LLMs (Park et al., 2024)
-
Linear probes can read off internal representations, e.g. world models in Othello-GPT (Nanda et al., 2023)
-
Steering vectors can causally influence model behavior (Panickssery et al., 2024)
-
Ablation of single directions can remove capabilities like refusal (Arditi et al., 2024)
MechInterp in Transformers – Superposition and SAEs:
-
Toy models show features are packed into fewer dimensions via superposition (Elhage et al., 2022)
-
Mathematical background: compressed sensing motivates why sparse recovery works (Candes et al., 2004; Donoho, 2006)
-
Sparse autoencoders decompose superposed representations into interpretable features (Bricken et al., 2023)
-
Evaluating SAE quality across architectures and metrics (Karvonen et al., 2025)
-
Crosscoders: SAE variants that read/write across layers, enabling model diffing (Lindsey et al., 2024)
-
Feature geometry, absorption, and splitting: failure modes of SAE feature learning (Chanin et al., 2024)
-
Activation oracles: using LLMs to automatically label and explain SAE features (Karvonen et al., 2025)
Circuit Discovery:
-
Success story: fully reverse-engineering a grokking model doing modular arithmetic (Nanda et al., 2023)
-
Logit attribution: decomposing outputs by component, discovering induction heads (Elhage et al., 2021)
-
The logit lens: projecting intermediate residual stream states through the unembedding (nostalgebraist, 2020)
-
Path patching: tracing causal pathways, discovering the IOI circuit (Wang et al., 2022)
-
ACDC: automated circuit discovery via activation patching (Conmy et al., 2023)
-
Attribution graphs: scalable circuit tracing revealing circuits like rhyming (Lindsey et al., 2025)
-
Causal scrubbing: a rigorous framework for validating circuit explanations (Chan et al., 2022)
-
Is circuit discovery NP-hard? Computational complexity of finding circuits (Adolfi et al., 2025)
Outlook:
-
Real-world use: Claude Opus 4.5 system card used interp to investigate deceptive behavior (Anthropic, 2025)
-
Open problems in mechanistic interpretability (Sharkey et al., 2025)
Critiques of MechInterp:
-
Pivoting from ambitious reverse-engineering to pragmatic, tool-oriented interp (Nanda et al., 2025)
-
Against almost every theory of impact of interpretability (de Segerie, 2023)
-
The misguided quest for mechanistic AI interpretability (Hendrycks, 2025)
-
Activation space interpretability may be doomed: fundamental limits of activation-based methods (Chughtai, 2025)