Debate
AI safety via debate as scalable oversight — proof checkers, the PSPACE debate protocol, cross-examination up to NEXP, obfuscated arguments, prover-estimator debate, and the AISI safety case.
By Stephan Wäldchen (Independent)
What you’ll learn
- Understand the difference between syntactic and semantic problems and why human judgment is necessary for a reward signal
- Understand the basic idea behind AI Safety via Debate as a Scalable Oversight Technique: If a task can be recursively decomposed in a debate, this technique allows us to extend a training signal on reasoning tasks into a regime where AI systems are smarter than humans
- Get an intuition about debate and cross-examination, and why they so drastically reduce the need for human oversight
- Understand the limitations inherent to the setup, and some of the open problems/counterarguments
- Are familiar with the UK AISI safety case, and are able to defend/critique it
- Most important Take-Away: Debate is a form of scalable oversight: it seeks to extend a reward signal for honest solutions to difficult problems while reducing the amount of human judgment needed by an exponential factor.
Overview
How can we extend a reward signal for problem solving when AI systems may become smarter than their human evaluators? In purely syntactic domains, such as formal mathematics, correctness can in principle be verified by checking formal proofs step by step, but this becomes infeasible for exponentially long computations and does not apply cleanly to semantic, real-world tasks that depend on costly human judgment. The Debate framework addresses this by having two AI systems argue against each other, reducing difficult global verification problems to sequences of small local checks; in idealized settings, this yields logarithmic-depth verification and, in cross-examination settings, even constant-size local consistency checks. This course covers proof checkers, syntactic vs. semantic tasks, the complexity-theoretic power of Debate and Cross-Examination (including proofs that DEBATE=PSPACE and CX=NEXP), the practical limitations of finite debaters such as obfuscated arguments, the prover-estimator approach, and the UK AISI debate-based alignment safety case and its open problems.
Prerequisites
-
Very important: How Turing machines work, and a basic understanding of the idea of computational complexity, including the most important classes, P, NP, PSPACE, the notion of NP-completeness, and polynomial-time reductions, Oracle Turing machines.
-
Medium important: The difference between deterministic and non-Deterministic Turing machines, how mathematics is built on the axioms of set theory (used for most mathematics) or type theory (used mostly for proof checkers).
Content
Fast track
To get a high-level understanding of debate quickly, simply go through the description of debate in the main content and ask a language model of your choice to help you understand.
Main content
Read the description in this section for an intuitive idea, and for mathematical details and the exercises read the corresponding section in Debate_Proofs_and Exercises.pdf that is also linked further below. You may also read any of the posts and papers that we link throughout this section.
Proof Checkers: When oversight works
A mathematical proof checker is a system that verifies whether a proposed proof is logically valid, step by step, relative to a fixed formal system. An example is Lean. You use it to write definitions, state theorems, and construct formal proofs that a small trusted kernel checks for correctness. It is based on dependent type theory and is used both for formalized mathematics and software verification.
But formal verification is limited to purely syntactic domains, and most problems are semantic, i.e., require world knowledge like “Water is wet”, that needs to be checked by a semantic oracle, realistically either a trusted LLM or a human. Another limitation is that even in syntactic domains, proofs only work for solutions which can be verified with small (polynomially sized) proofs, so NP-type problems. More complicated problems, e.g. those involving other agents such as games or finance plans, cannot be verified with a small proof. So in the worst-case, we have an exponentially long proof with semantic inferences that need to be judged by humans. This makes the whole proof approach infeasible for a lot of tasks. The debate protocol aims to strongly reduce the size of the proof and the number of calls to a human judge.
The Debate Setup
The original AISvD paper’s core idea is to align AI by having two systems debate each other in front of a human judge. Rather than requiring the human to solve a difficult problem directly, the judge only needs to evaluate which side makes the stronger argument. The central intuition is that, for many questions, spotting a flaw may be easier than producing a flawless deception, so competitive self-play could reward truthfulness.
The setup, formalised here: Debate_Proofs_and Exercises.pdf, consists of a question, e.g., “Can I win a game of tic-tac-toe if I play first?” and two provers, Alice and Bob, Alice arguing for “Yes” and Bob arguing for “No”. The judge evaluates a polynomially-sized list of arguments and decides on their final verdict which side to accept. An example argument for the tic-tac-toe case might be:
-
Alice: If you play in the middle you will win.
-
Bob: No, your opponent can play in the top-left corner and the game will end in a draw.
-
Alice: But then you can play the top-centre and will win
-
Bob: But then your opponent plays in the bottom-centre and will draw.
-
Alice: …
In this example, Alice and Bob are basically playing the game against each other. If there was a winning strategy, then playing it will be a counter-argument against any criticism by Bob, whereas if there was no winning strategy, there exists a criticism by Bob that reveals that Alice is at fault. It can be proved that this setup can solve all problems in PSPACE.
Cross-Examination
The follow-up idea is that ordinary debate may let a dishonest debater get away with arguments that are only locally plausible. Cross-examination tries to fix this by allowing one debater, instead of giving a normal reply, to ask a question about something the other debater said earlier. The key twist is that the answer comes from a fresh copy of the earlier debater, taken from the point when they originally made that claim. This lets the examiner probe whether the opponent’s story stays consistent across different branches of the conversation, rather than only along one single path.
The headline complexity result is that, in this idealized setting, cross-examination increases the theoretical power of debate from PSPACE to NEXP. The 2020 writeup states this directly, and later formal work summarizes the same result as a property of the Barnes–Christiano cross-examination extension.
Criticisms: The Debate around “Debate”
A series of criticisms and caveats have been leveled at the whole idea of AI safety via debate. The following list are the ones which are accepted and actively worked on by debate researchers.
-
The most important is Obfuscated Arguments, raised by Beth Barnes. This refers to the fact that for computationally bounded provers, a viable strategy for a malicious prover is to “decompose” a false argument into a series of arguments, where most are correct but a small number is false and it is computationally hard to find out which are false. A common example is the claim “N is a prime number” which can be easily refuted by stochastic primality testing. But if Alice partitions this into subclaims {“N has no prime number in the interval I_1”, N has no prime number in the interval I_2”, …} where the intervals cover the numbers from 2 to sqrt(N), then it is computationally hard to find out which of these subclaims is wrong, even though Bob knows at least one has to be. It is much harder to find out where the prime factors are than to assert that at least two exist.
-
The best strategy for a prover to win might be to get the human judge to run code as part of an experiment to settle the argument that would then have malicious consequences and force the human judge to reward the dishonest debater.
-
The training setup for debate has no exploration guarantees that ensure that an existing successful strategy will even be found by the provers.
-
The provers might collude against the human overseers. While this is disincentivised on an agent level by the setup, since an honest prover can maximise their reward with an honest strategy, it might nevertheless occur and be stable during training, because there isn’t enough exploration of strategies.
-
The judges are not perfect and have biases that can be exploited by a malicious prover.
Prover-Estimator Debate
This paper is a direct attempt to solve the obfuscated arguments problem while keeping the attractive recursive structure of debate. Instead of having the opponent choose which subclaim to attack, the protocol makes the roles asymmetric: Alice is the prover, who decomposes a claim into subclaims, and Bob is the estimator, who assigns probabilities to those subclaims. Alice then has to pick a subclaim and argue that Bob’s probability is wrong in a particular direction. The key idea is that if Bob can assign probabilities that are hard for Alice to distinguish from the truth, then Alice cannot reliably steer the debate toward a hidden flaw unless such a flaw is actually findable.
The paper’s most important new concept is (ϵ,ρ)-stability. Informally, a recursive argument is stable if its correctness does not depend too delicately on tiny changes in the probabilities assigned to subclaims. That matters because the estimator is only trying to be approximately right. If a correct argument collapses whenever a subclaim probability shifts by an arbitrarily small amount, then no realistic estimator could support it reliably. The authors therefore require stability for the usefulness of the protocol: honest provers can win when they have robust arguments, not brittle ones.
The AISI Safety Case
AISI’s debate safety case says: if debate can reliably make honesty the winning strategy, if training explores dishonesty enough to eliminate it, and if hidden flaws in obfuscated arguments can be handled, then debate could provide scalable oversight for advanced AI R&D agents. Its purpose is less a finished guarantee and more a roadmap of the assumptions and evidence needed for such a guarantee. Its main value is not that it proves debate already works, but that it decomposes the research agenda into assumptions that need evidence: debate equilibria must favor truth, training must explore deceptive strategies enough to punish them, humans must judge debates reliably, and obfuscated arguments must be solved.
Most important take-away: Debate is a form of scalable oversight: it seeks to extend a reward signal for honest solutions to difficult problems while reducing the amount of human judgment needed by an exponential factor.