Evaluations
Evaluations
  • Evaluations

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

July 15, 2025
Read more
Evaluations
  • Evaluations
  • Notes

Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals

July 3, 2025
Read more
Evaluations
  • Evaluations

More Capable Models Are Better At In-Context Scheming

June 19, 2025
Read more
Evaluations
  • Evaluations
  • Notes

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

March 17, 2025
Read more
Evaluations
  • Evaluations

Forecasting Frontier Language Model Agent Capabilities

February 24, 2025
Read more
Evaluations
  • Evaluations

Demo Example – Scheming Reasoning Evaluations

January 23, 2025
Read more
Evaluations
  • Evaluations

Frontier Models are Capable of In-Context Scheming

December 5, 2024
Read more
Evaluations
  • Evaluations

The Evals Gap

November 11, 2024
Read more
Evaluations
  • Evaluations

Towards Safety Cases For AI Scheming

October 31, 2024
Read more
Evaluations
  • Evaluations

An Opinionated Evals Reading List

August 15, 2024
Read more
Evaluations
  • Evaluations

Black-Box Access is Insufficient for Rigorous AI Audits

April 4, 2024
Read more
Evaluations
  • Evaluations

We Need A ‘Science of Evals’

January 22, 2024
Read more
Evaluations
  • Evaluations

A Starter Guide For Evals

January 8, 2024
Read more
Evaluations
  • Evaluations

Large Language Models can Strategically Deceive their Users when Put Under Pressure

November 9, 2023
Read more
Evaluations
  • Evaluations

A Causal Framework for AI Regulation and Auditing

November 8, 2023
Read more
Evaluations
  • Evaluations

Our research on strategic deception presented at the UK’s AI Safety Summit

November 5, 2023
Read more
Evaluations
  • Evaluations

Understanding strategic deception and deceptive alignment

September 15, 2023
Read more