Computational Mechanics
Computational mechanics for AI safety — hidden Markov models, generalised HMMs, belief states, mixed-state presentations, and evidence that transformers represent belief geometry in their residual streams.
By Xavier Poncini (Simplex)
What you’ll learn
- Construct their own hidden Markov models.
- Explain the motivation for introducing generalised hidden Markov models (GHMMs).
- Explain why the belief state is a useful object for making predictions about GHMM data.
- Compute the belief state of a GHMM.
- Explain the difference between the belief state of a HMM and the belief state of a GHMM (that is itself not an HMM).
- Explain the relationship between belief states and next-token observation probabilities for GHMMs.
- Compute the mixed state presentation (MSP) for GHMMs.
- Explain why generating data is easier than predicting data.
- Explain the evidence for the claim: transformers represent belief state geometry in their residual stream.
- Develop hypotheses for where transformers represent: belief states, observation probabilities, log observation probabilities.
Overview
Suppose a neural network is performing near-optimal prediction of a fixed stochastic process: what internal representations should we expect it to maintain? Computational mechanics addresses this question by identifying convergent structures of optimal prediction. In this module, we develop this framework for processes admitting a generalised hidden Markov model (GHMM) realisation. Starting from hidden Markov models (HMMs), we motivate the definition of GHMMs through considerations of minimality and uniqueness. We then develop two complementary perspectives on Bayesian inference over HMM emissions: a geometric perspective through belief states, and an algorithmic perspective through the mixed state presentation (MSP) — both of which naturally extend to GHMMs. Participants will design their own processes and explore the empirical evidence that transformers trained on GHMM data learn belief-state geometry in their residual streams, before developing mechanistic hypotheses about how this geometry is constructed and used.
This module was compiled by Xavier Poncini, with Adam Shai and Paul Riechers participating in a Q&A.
Prerequisites
Module on Mechanistic Interpretability. Familiarity with the goals and basic methodology of mechanistic interpretability, including the notions of features and circuits; the linear representation hypothesis; linear probes as a method for reading off internal representations; and a basic understanding of the transformer architecture. Students should be comfortable with the idea that one can train a linear map from model activations to some target structure and evaluate its quality (e.g. via MSE or R^2).
Mathematical background. The module is mathematically self-contained — all formal definitions (HMMs, GHMMs, belief states, MSPs) are introduced from scratch. However, students will engage with the material more fluently if they are comfortable with the following:
-
Linear algebra: row-stochastic matrices, rank and invertibility, null spaces (kernels).
-
Probability: conditional probability, Bayes' rule, probability distributions over finite sets, the probability simplex.
-
Basic machine learning: next-token prediction, loss functions (cross-entropy), the concept of model activations.
Content
Fast track
Read 'Transformers represent belief state geometry in their residual stream'. Focus on being able to answer the following questions:
-
What is a hidden Markov model (HMM)?
-
What is a belief state?
-
What is the mixed state presentation (MSP)?
-
About the map learnt from activations to belief states: What is the source? What is the target? How is the quality of the map evaluated?
Main content
[Lecture slides] Overview and scope
-
What is computational mechanics?
-
Computational mechanics & AI safety
-
Where is the research program at?
-
What will we cover today?
Computational mechanics foundations
-
[Lecture Slides & Exercise Sheet] Hidden Markov Models (HMMs)
-
What is an HMM?
-
Notions of minimality & uniqueness
-
What is a generalised HMM (GHMM)? (see e.g. Neural networks leverage nominally quantum and post-quantum representations)
-
-
[Lecture slides & Exercise Sheet] Prediction
-
Belief states
-
Next-token (probability) vectors
-
Mixed-state-presentation (MSP) (see e.g. Mixed States of Hidden Markov Processes and Their Presentations: What and how to calculate)
-
Applications
-
[Tutorial Sheet] Designing processes worth studying
-
Heuristic ensuring beliefs are meaningfully richer than obs probs
-
Non-ergodic
-
Factors (see e.g. Transformers learn factored representations)
-
-
[Reading guide] Identifying belief geometry in transformers
-
For the case of HMMs – [Read: Transformers represent belief state geometry in their residual stream].
-
[extension] For the case of GHMMs – [Read: Neural networks leverage nominally quantum and post-quantum representations]
-
-
[Tutorial Sheet] How do transformers do it?
-
How do transformers construct representations of belief states? Partial answer in narrow setting – [Read: Constrained belief updates explain geometric structures in transformer representations]
-
How do they decode from belief states to log obs probs then obs probs? This is an open problem!
-
Learn more
-
Beyond toy architectures — in-context learning. LLMs have an amazing ability to adapt internal representations in-context — read 'In-Context Learning of Representations'. LLMs can predict HMM data in-context — read 'Pre-trained Large Language Models Learn Hidden Markov Models In-context'. This suggests that internal representations of LLMs predicting GHMM data in-context may resemble belief states.
-
Processes consisting of GHMMs with richer structure — Transformers learn factored representations.
-
Review article of computational mechanics — Between order and chaos.
-
Simplex is a non-profit research organisation working on applying computational mechanics to AI safety.