Science

We conduct fundamental research into the science of scheming and its potential mitigations, as well as develop and run pre-deployment evaluations of frontier AI systems.

our key papers

Stress Testing Deliberative Alignment for Anti-Scheming Training

17/09/2025
Read more

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

15/07/2025
Read more

Frontier Models are Capable of In-Context Scheming

05/12/2024
Read more

Towards Safety Cases For AI Scheming

31/10/2024
Read more

Large Language Models can Strategically Deceive their Users when Put Under Pressure

09/11/2023
Read more

A Causal Framework for AI Regulation and Auditing

08/11/2023
Read more
All posts
Filter
  • All
  • Science of Scheming
  • Evaluations
  • Interpretability
  • Notes
Science of Scheming
  • Science of Scheming

Metagaming matters for training, evaluation, and oversight

16/03/2026
Read more
Science of Scheming
  • Science of Scheming

Stress Testing Deliberative Alignment for Anti-Scheming Training

17/09/2025
Read more
Evaluations
  • Evaluations

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

15/07/2025
Read more
Evaluations
  • Evaluations
  • Notes

Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals

03/07/2025
Read more
Evaluations
  • Evaluations

More Capable Models Are Better At In-Context Scheming

19/06/2025
Read more
Evaluations
  • Evaluations
  • Notes

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

17/03/2025
Read more
Evaluations
  • Evaluations

Forecasting Frontier Language Model Agent Capabilities

24/02/2025
Read more
Interpretability
  • Interpretability

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition

11/02/2025
Read more
Interpretability
  • Interpretability

Detecting Strategic Deception Using Linear Probes

06/02/2025
Read more
Evaluations
  • Evaluations

Demo Example – Scheming Reasoning Evaluations

23/01/2025
Read more
Evaluations
  • Evaluations

Frontier Models are Capable of In-Context Scheming

05/12/2024
Read more
Evaluations
  • Evaluations

The Evals Gap

11/11/2024
Read more
Evaluations
  • Evaluations

Towards Safety Cases For AI Scheming

31/10/2024
Read more
Evaluations
  • Evaluations

An Opinionated Evals Reading List

15/08/2024
Read more
Interpretability
  • Interpretability

Identifying functionally important features with end-to-end sparse dictionary learning

30/05/2024
Read more
Interpretability
  • Interpretability

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

30/05/2024
Read more
Evaluations
  • Evaluations

Black-Box Access is Insufficient for Rigorous AI Audits

04/04/2024
Read more
Evaluations
  • Evaluations

We Need A ‘Science of Evals’

22/01/2024
Read more
Evaluations
  • Evaluations

A Starter Guide For Evals

08/01/2024
Read more
Evaluations
  • Evaluations

Large Language Models can Strategically Deceive their Users when Put Under Pressure

09/11/2023
Read more
Evaluations
  • Evaluations

A Causal Framework for AI Regulation and Auditing

08/11/2023
Read more
Evaluations
  • Evaluations

Our research on strategic deception presented at the UK’s AI Safety Summit

05/11/2023
Read more
Evaluations
  • Evaluations

Understanding strategic deception and deceptive alignment

15/09/2023
Read more