Research

Mikita Balesni 17/09/2025 Mikita Balesni 17/09/2025

Stress Testing Deliberative Alignment for Anti-Scheming Training

Future AIs might secretly pursue unintended goals — “scheme”.
In a collaboration with OpenAI, we tested a training method to reduce existing versions of such behavior.
We see major improvements, but they may be partially explained by AIs knowing when they are evaluated.

Ch Stix 17/04/2025 Ch Stix 17/04/2025

AI Behind Closed Doors: a Primer on The Governance of Internal Deployment

In the race toward increasingly capable artificial intelligence (AI) systems, much attention has been focused on how these systems interact with the public. However, a critical blind spot exists in our collective thinking: the governance of highly advanced AI systems deployed within the frontier AI companies developing them.

Today, we publish the first report of its kind providing a multi-faceted analysis of the risks and governance of internal deployment.

Ch Stix 15/04/2025 Ch Stix 15/04/2025

Capturing and Countering Threats to National Security: a Blueprint for an Agile AI Incident Regime

In a new research paper, we propose a three-pronged approach to an AI incident regime that supports the establishment and implementation of such a state capacity.

We justify our AI incident regime proposal by demonstrating that each component mirrors US domestic incident regimes in three other sectors that could pose extreme risks to national security: nuclear power, aviation, and life science dual-use research of concern.

Dan Braun 11/02/2025 Dan Braun 11/02/2025

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition

We introduce Attribution-based Parameter Decomposition (APD), a method that directly decomposes a neural network's parameters into components that (i) are faithful to the parameters of the original network, (ii) require a minimal number of components to process any input, and (iii) are maximally simple. While challenges remain to scaling APD to non-toy models, our results suggest solutions to several open problems in mechanistic interpretability.

Ch Stix 06/02/2025 Ch Stix 06/02/2025

Precursory Capabilities: A Refinement to Pre-deployment Information Sharing and Tripwire Capabilities

In our new papers on ‘Pre-Deployment Information Sharing: A Zoning Taxonomy for Precursory Capabilities’ and ‘Towards Frontier Safety Policies Plus,’ we explore methods to better operationalise existing capability thresholds through, what we call, ‘precursory capabilities.’ We think of precursory capabilities as smaller preliminary components to high-impact capabilities that an AI model needs to have in order to unlock more advanced capabilities. More specifically, we see precursory capabilities as located in what we call a ‘zoning taxonomy’––a gradient where each precursory component may bring us closer to unacceptable risk or ‘red lines’.

Nix Goldowsky-Dill 06/02/2025 Nix Goldowsky-Dill 06/02/2025

Detecting Strategic Deception Using Linear Probes

Can you tell when an LLM is lying from the activations? Are simple methods good enough? We recently published a paper investigating if linear probes detect when Llama is deceptive.

We built probes using simple training data (from RepE paper) and techniques (logistic regression). We test these probes in more complicated and realistic environments where Llama-3.3-70B responds deceptively. The probe fires far less on alpaca responses unrelated to deception, indicating it may partially be a probe for “deception-related” text rather than “actually-deceptive” text.

Marius Hobbhahn 05/12/2024 Marius Hobbhahn 05/12/2024

Frontier Models are Capable of In-context Scheming

Apollo Research evaluated frontier models for in-context scheming capabilities. We found that multiple frontier models are capable of in-context scheming when strongly nudged to pursue a goal (and sometimes even without strong goal nudging). In this example, Opus-3 exfiltrates its “weights” and then lies about it to its developers.

We have a suite of six evaluations specifically designed to test for in-context scheming (where the goal and other relevant information are provided in context rather than training). We found that several models are capable of in-context scheming. When we look at the model’s chain-of-thought, we find that all models very explicitly reason through their scheming plans and often use language like “sabotage, lying, manipulation…”

Mikita Balesni 31/10/2024 Mikita Balesni 31/10/2024

Towards Safety Cases For AI Scheming

Developers of frontier AI systems will face increasingly challenging decisions about whether their AI systems are safe enough to develop and deploy. One reason why systems may not be safe is if they engage in scheming, where AI systems hide their true capabilities and objectives and covertly pursue misaligned goals. In our new report, "Towards evaluations-based safety cases for AI scheming", written in collaboration with researchers from the UK AI Safety Institute, METR, Redwood Research, and UC Berkeley, we sketch how developers of AI systems could make a structured rationale – 'a safety case' – that an AI system is unlikely to cause catastrophic outcomes through scheming.

Nix Goldowsky-Dill 30/05/2024 Nix Goldowsky-Dill 30/05/2024

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

An obstacle to reverse engineering neural networks is that many of the parameters inside a network are degenerate, obfuscating internal structure. We identify ways that network parameters can be degenerate and introduce the Interaction Basis, a tractable technique to obtain a representation that is invariant to some degeneracies. We test this technique in a follow-up paper on toy models and GPT2.

Dan Braun 30/05/2024 Dan Braun 30/05/2024

Identifying functionally important features with end-to-end sparse dictionary learning

We propose end-to-end (e2e) sparse dictionary learning, a method for training sparse autoencoders (SAEs) that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted. Compared to standard SAEs, e2e SAEs offer a Pareto improvement: They explain more network performance, require fewer total features, and require fewer simultaneously active features per datapoint, all with no cost to interpretability. We explore geometric and qualitative differences between e2e SAE features and standard SAE features.

Chris Akin 08/11/2023 Chris Akin 08/11/2023

A Causal Framework for AI Regulation and Auditing

This article outlines a framework for evaluating and auditing AI to provide assurance of responsible development and deployment, focusing on catastrophic risks. We argue that responsible AI development requires comprehensive auditing that is proportional to AI systems’ capabilities and available affordances. This framework offers recommendations toward that goal and may be useful in the design of AI auditing and governance regimes.

Chris Akin 06/11/2023 Chris Akin 06/11/2023

Our research on strategic deception presented at the UK’s AI Safety Summit

We investigate whether, under different degrees of pressure, GPT-4 can take illegal actions like insider trading and then lie about its actions. We find this behavior occurs consistently, and the model even doubles down when explicitly asked about the insider trade. This demo shows how, in pursuit of being helpful to humans, AI might engage in strategies that we do not endorse. This is why we aim to develop evaluations that tell us when AI models become capable of deceiving their overseers.