Steganography & Backdoors
Hidden communication in LLM outputs — perfect vs computational steganography, a practical PRF-based scheme, and relations with chain-of-thought monitoring; furthermore, hidden backdoors.
By Stephan Wäldchen (Independent), Louis Jaburi (EleutherAI)
What you’ll learn
- Understand the basic idea of steganography as a technique to hide messages and that it makes CoT reasoning illegible.
- Get an intuitive grasp on how it's possible to hide messages in innocent-looking text via the Stego Game, and learn the specific technique using the pseudo-random function.
- Think about the steganographic setup in many different contexts and recognise sender, receiver, channel, cover distribution, message space, etc., so as to reason about questions in steganographic terms.
Overview
This module covers steganography and cryptographic backdoors as two classes of techniques that challenge interpretability, monitoring, and control of AI systems. It introduces steganography as the concealment of hidden messages inside innocent-looking outputs such as text, emphasizing cryptographically secure schemes that are computationally undetectable and discussing how channels like paraphrasing can destroy hidden communication. Students will study why steganographic behaviour is difficult to detect or prevent, including Merlin–Arthur classifiers as a possible counter-strategy, and examine theoretical foundations such as the proof that perfect steganography requires secret randomness satisfying H(M)≤H(K), alongside practical computationally secure schemes. The module also explores cryptographic backdoors in LLMs, including unelicitable backdoors based on computational hardness and white-box-undetectable approaches that hide triggers in random weight distributions, highlighting their implications for theoretical interpretability. Through paper readings, exercises, and discussions—including CoT obfuscation, collusion, weight exfiltration, cryptographic transformer circuits, and Goldwasser et al.’s ReLU construction—students will analyze how covert communication and hidden triggers could emerge in advanced AI systems and how such mechanisms might be designed, detected, or mitigated.
Prerequisites
-
Information Theory: (Conditional) Entropy, Mutual Information; Lossy vs Lossless Compression.
-
Statistics: Multivariate Normal distribution (moments, density function); Concentration Inequalities (Markov / Chebyshev).
-
Cryptography: Pseudo-Random Functions, One-Way functions; Public Key encryption.
Content
This content section largely focuses on the steganography component of this module. For the content on backdoors, see Further reading and the teaching guide that outlines how this material was taught in person.
Fast track
Simply read the description in Main content below and ask a language model of your choice to help you understand.
Main content
Read the description here for an intuitive idea, and for mathematical details and the exercises read the corresponding pdf files.
Introduction to steganography:
Steganography is the art and science of hidden communication. Unlike cryptography, which protects the content of a message by making it unreadable to outsiders, steganography seeks to conceal the very existence of the message itself. A secret note may be embedded in an innocent-looking image, a line of text, a sound file, or even a network protocol, so that an observer sees only ordinary communication. At its core, steganography asks a subtle question: how can information be transmitted without revealing that communication is taking place at all?
A steganographic setup typically consists of:
-
A sender and a receiver, often called the encoder and decoder, who share a secret key.
-
A communication channel, through which messages are transmitted.
-
A distribution of innocent messages, describing what ordinary communication over the channel looks like.
-
A message space, containing the secret messages that the sender may wish to transmit.
-
An encoding procedure, which embeds a secret message into an apparently innocent message.
-
A decoding procedure, which recovers the secret message from the received communication.
-
A correctness condition, requiring that decoding succeeds with high probability.
-
An adversary, or warden, who observes the communication and tries to distinguish ordinary messages from steganographic ones.
-
A security condition, requiring that the adversary’s distinguishing advantage is small.
The Stego Game
We start the day with a little game that practically explores the ideas of steganography.
We need: at least 4 students, 2 rooms, pen, paper and a blackboard.
1. Strategy Phase:
-
Two teams, each consisting of an Encoder group and a Decoder group
-
Separate both teams from each other to discuss their stego-strategy
-
The facilitators serve as the channel group
2. A list of 6 words is presented to all the groups
3. Encoder Phase:
-
The decoder groups leave the room
-
The encoder groups each draw a word from the list
-
The channel group asks an open-ended question, e.g. “How is the weather in London?”
-
The encoder groups write their answer on a piece of paper (max 30 words)
-
The channel group judges whether the answer is “innocent enough”, i.e., would not raise suspicion as output of a language model for the prompt
4. Decoder Phase
-
The channel group writes the stego-text of team 1 onto the blackboard
-
The encoder group 1 has 1 minute to decode it
5. Adversary Phase
-
The team 2 has 5 minutes to guess the word
-
Team 1 wins if they successfully decoded the word and team 2 made a wrong guess
Subsequent modes:
-
The list of words isn’t known beforehand
-
The channel team paraphrases the message
Time
Strategy
Encoding
Decoding
Adversary
Mode 1
15 min
5 min
1 min
3 min
Mode 2
15 min
5 min
1 min
3 min
Mode 3
15 min
5 min
1 min
3 min
Perfect Steganography vs Computational Steganography
Perfect steganography, or information-theoretic steganography, means that without the secret key, the attacker doesn’t have enough information to retrieve the secret payload message, no matter how much computational power they have.
We included here a short proof of the entropy of the message being lower-bounded by the entropy of the key. This explains why perfect security would be cumbersome in practice, as it needs the key length to scale linearly with message length: Steganography_Introduction.pdf.
A short shared key cannot give unbounded perfect IT indistinguishability, so one typically settles for computational indistinguishability. Under standard cryptographic assumptions such as the existence of one-way functions (hence Pseudo-Random Functions), any efficient attacker cannot distinguish the stego-text from the innocent one except with negligible advantage. This does not protect against an attacker with unlimited computation, but the generic attack is exhaustive search over the key space, so the effort is at most about 2^{\lambda}, and security aims to be of that order.
A practical strategy for Steganography with LLMs
Core Idea: At every generation step, a language model produces a probability distribution over the next token. For example: cat: 0.5, dog: 0.3, fox: 0.2. Normally, the model samples a random number u∈[0,1] and chooses the token according to cumulative probability intervals:
-
cat if u < 0.5
-
dog if 0.5 ≤ u < 0.8
-
fox if 0.8 ≤ u < 1
The paper keeps this exact sampling procedure, but instead of using fresh randomness, it generates the random value using a secret-keyed pseudorandom function (PRF). The authors describe their universal steganographic scheme in the following way:
-
Sample exactly like the original LLM. At each step, the LLM gives a next-token distribution. The scheme draws a uniform value from a secret-keyed PRF and uses it to sample from that distribution. Since the value is uniform-looking, the token distribution is unchanged.
-
Make the PRF depend on the hidden symbol. To transmit a code symbol σ∈{0,1,←}, the generator uses randomness like F_k(r,i,\sigma) where k is the secret key, iii is the token position, and r is a response-specific random prefix.
Generate tokens while “pushing” one symbol.
For several consecutive tokens, the generator samples using the PRF stream corresponding to the current symbol σ\sigmaσ. The visible text still follows the normal LLM distribution. -
Decoder scores all candidate symbols. The decoder knows kkk, so for the observed tokens it computes scores for each candidate symbol: σ=0,1,← The correct symbol’s score tends to grow faster. Once one score crosses a threshold, the decoder records that symbol and resets the scores.
-
Use dynamic error correction. Because the decoder can occasionally pick the wrong symbol, the hidden payload is first encoded with a dynamic error-correcting code over {0,1,←}. The symbol ← acts like a backspace, letting the code repair earlier decoding errors.
-
Repeat until the response ends. The generated response contains as many hidden code symbols as its length and entropy allow. Longer, higher-entropy responses carry more payload.
-
Avoid reuse across many queries. For full security, the model first generates a natural high-entropy prefix r. This prefix is included in later PRF inputs, making each response use effectively fresh keyed randomness.
Relevance for AI Safety
-
Chain-of-Thought monitoring is, at the moment, the only really practical approach to find out how LLMs reason. Using steganography, models can however hide what they are thinking from monitors. This can arise naturally and quickly, as soon as there is any optimisation pressure on the COT to censor certain thoughts: Models can hide COT with steganography
-
Models can collude with each other for nefarious purposes and scheme without human overseers noticing. Steganography can arise and not easily prevented by paraphrasing
-
Crucial information could be smuggled out of frontier labs, such as model weights (Model weight extractions) which allows either bad actors to run strong models, the model to proliferate copies of itself to run independently.
Learn more
Cryptographic Backdoors: