Iliad

Computational Mechanics

Cluster CDownload .md

Computational mechanics for AI safety — hidden Markov models, generalised HMMs, belief states, mixed-state presentations, and evidence that transformers represent belief geometry in their residual streams.

By Xavier Poncini (Simplex)

What you’ll learn

  • Construct their own hidden Markov models.
  • Explain the motivation for introducing generalised hidden Markov models (GHMMs).
  • Explain why the belief state is a useful object for making predictions about GHMM data.
  • Compute the belief state of a GHMM.
  • Explain the difference between the belief state of a HMM and the belief state of a GHMM (that is itself not an HMM).
  • Explain the relationship between belief states and next-token observation probabilities for GHMMs.
  • Compute the mixed state presentation (MSP) for GHMMs.
  • Explain why generating data is easier than predicting data.
  • Explain the evidence for the claim: transformers represent belief state geometry in their residual stream.
  • Develop hypotheses for where transformers represent: belief states, observation probabilities, log observation probabilities.

Overview

Suppose a neural network is performing near-optimal prediction of a fixed stochastic process: what internal representations should we expect it to maintain? Computational mechanics addresses this question by identifying convergent structures of optimal prediction. In this module, we develop this framework for processes admitting a generalised hidden Markov model (GHMM) realisation. Starting from hidden Markov models (HMMs), we motivate the definition of GHMMs through considerations of minimality and uniqueness. We then develop two complementary perspectives on Bayesian inference over HMM emissions: a geometric perspective through belief states, and an algorithmic perspective through the mixed state presentation (MSP) — both of which naturally extend to GHMMs. Participants will design their own processes and explore the empirical evidence that transformers trained on GHMM data learn belief-state geometry in their residual streams, before developing mechanistic hypotheses about how this geometry is constructed and used.

This module was compiled by Xavier Poncini, with Adam Shai and Paul Riechers participating in a Q&A.

Prerequisites

Module on Mechanistic Interpretability. Familiarity with the goals and basic methodology of mechanistic interpretability, including the notions of features and circuits; the linear representation hypothesis; linear probes as a method for reading off internal representations; and a basic understanding of the transformer architecture. Students should be comfortable with the idea that one can train a linear map from model activations to some target structure and evaluate its quality (e.g. via MSE or R^2).

Mathematical background. The module is mathematically self-contained — all formal definitions (HMMs, GHMMs, belief states, MSPs) are introduced from scratch. However, students will engage with the material more fluently if they are comfortable with the following:

  • Linear algebra: row-stochastic matrices, rank and invertibility, null spaces (kernels).

  • Probability: conditional probability, Bayes' rule, probability distributions over finite sets, the probability simplex.

  • Basic machine learning: next-token prediction, loss functions (cross-entropy), the concept of model activations.

Content

Fast track

Read 'Transformers represent belief state geometry in their residual stream'. Focus on being able to answer the following questions:

  • What is a hidden Markov model (HMM)?

  • What is a belief state?

  • What is the mixed state presentation (MSP)?

  • About the map learnt from activations to belief states: What is the source? What is the target? How is the quality of the map evaluated?

Main content

[Lecture slides] Overview and scope

  • What is computational mechanics?

  • Computational mechanics & AI safety

  • Where is the research program at?

  • What will we cover today?

Computational mechanics foundations

Applications

Learn more