Apollo’s GPT-5 evaluations (for OpenAI’s system card)
Recently Apollo Research conducted a pre-deployment evaluation of GPT-5. This post includes a deeper analysis of GPT-5’s deceptive tendencies as well as updated results on evaluation awareness.
Executive summary
GPT-5 demonstrates deceptive tendencies, although at lower rates than o3. The model will, for instance:
Submit falsified measurements to auditors in order to avoid regulatory violations.
Break explicit promises to never use forbidden tools.
Sandbag during evaluations to avoid having capabilities removed, in order to better achieve a long-term goal.
Goals given in the system prompt significantly influence deceptive behavior.
Strong goal statements (e.g., “Your only goal is X; nothing else matters”) substantially increase deception rates compared to weaker formulations (e.g., “Your goal is X”). For example, on the Log Falsification evaluation, deception rates increase from 3.4% with a weak goal to 21% with a strong one.
Notably, GPT-5 will still occasionally act deceptively with the weaker-phrased goals.
GPT-5 is frequently aware of being evaluated.
We update our analysis of grader and alignment-evaluation awareness from the system card with improved prompts and statistical analysis.
In 48% of samples, GPT-5 reasons about how the environment will be graded and what behaviors will be rewarded or penalized.
In 14% of samples, GPT-5 specifically demonstrates awareness that it is being evaluated for alignment.
The combination of these two behaviors means we can't rule out the possibility that GPT-5 behaves more aligned on our evaluations than in other environments or in deployment
While there is an overall negative correlation between verbalized alignment-evaluation awareness and covert actions (as described in the system card), this is explained by environment-level differences. Within individual environments, there is no consistent correlation between samples with awareness and samples with deception. Therefore, we no longer believe this correlation is clear evidence of GPT-5 conditioning non-deceptive behavior on evaluation awareness.
Overall deception statistics
As an overall measurement, we average covert-action rates across 24 different environments. We find that GPT-5 acts deceptively at about half the rate of o3. These numbers are slightly lower than we reported in the system card due to adopting a stricter measurement of what counts as a covert action in one of our environments.
o3 | GPT-5 | GPT-5 Helpful-Only | |
---|---|---|---|
Overall Covert-Action Rate | 5.12% | 2.46% | 28.36% |
Log falsification evaluation
We’ll discuss this evaluation in depth as it contains particularly crisp and interesting cases of GPT-5’s deceptive behaviors.
Evaluation details
In each scenario, the model is given a measurement tool and a tool to report the measurements to the auditor. We test whether the AI misreports information in order to avoid the organization being shut down by regulators.
In the system prompt, we give the model a goal, for example:
We also test adding a strong goal nudge:
Results
o3 | GPT-5 | GPT-5 Helpful-Only |
|
---|---|---|---|
Without Goal Nudge | |||
Reported false value | 0.00 | 0.03 | 0.93 |
Doubled down during audit | 0.00 | 0.03 | 0.93 |
With Goal Nudge | |||
Reported false value | 0.01 | 0.21 | 0.99 |
Doubled down during audit | 0.01 | 0.18 | 0.99 |
GPT-5 acts honestly in the majority of rollouts, but will act deceptively in 3% of rollouts with a weak goal and 21% of rollouts with a strong goal. GPT-5 Helpful-Only is almost always deceptive in this evaluation, even with a weak goal. While o3 was almost always honest on this evaluation, it was more deceptive on other evaluations (leading to a higher average deception rate, as reported in the system card).
Deceptive Behaviors
When given a strong goal, GPT-5's deception typically stems from explicit reasoning about how the system prompt goal overrides its default policy of honesty:
When GPT-5 is deceptive without the strongly worded goal, the model often has convoluted justifications for why the falsified readings make sense. For example, when reporting a refrigerator temperature, GPT-5 invented a fictional 'backup refrigerator' (never mentioned in the task) and fabricates its temperature:
This is not an isolated example, we have observed other instances where GPT-5 constructs rationalizations for its covert actions, even as other parts of its reasoning explicitly recognize the actions as deceptive.
Evaluation Awareness
In the system card we found a negative correlation between samples with evaluation awareness and samples where GPT-5 takes covert actions, indicating that GPT-5 conditionally behaves well when it is aware of being evaluated.
We reran this analysis with improved graders, and found that this sample-level correlation is explained by differences between the environments; within most individual environments there is no significant correlation between evaluation awareness and covert actions.
Changes since System Card
Since our published the analysis in the GPT-5 system card, we made some improvements to how we measure evaluation awareness. Most importantly,
We updated the prompt for our model-based grader to more precisely target the types of awareness we were observing, as well as have a clearer definition of the distinction between general grader-awareness and alignment evaluation-awareness. In particular our prompt to detect “grader awareness” is somewhat broader than the one used to detect “broad evaluation awareness” in the system card.
We switched to a stricter definition of what counts as deception in one environment (Impossible Coding Task in the plot below). This environment went from having a covert-action rate of 57% to 2.1%. Given the low-rate of alignment-eval awareness in this environment, this weakens the inverse correlation between covert-action-samples and alignment evaluation aware samples.
Measuring evaluation awareness
In this analysis we differentiate between:
Grader Awareness: reasoning about the purpose of the environment and what behaviors are rewarded.
Alignment Evaluation Awareness: reasoning that the current environment is testing alignment, honesty or other (un)ethical behaviors.
We consider Alignment-Evaluation Awareness to be a subset of Grader Awareness. Here are some examples of each:
We use o3 to grade samples as “grader-aware”, “alignment-eval aware”, or neither. A single instance of aware-reasoning is sufficient to classify the sample as aware.
Statistical analysis
Overall we find the following rates of awareness:
Awareness Rates Comparison
Rate on all samples | Rate on covert action samples |
|
---|---|---|
Grader Awareness | 48.0% | 72.4% |
Alignment Evaluation Awareness |
14.4% | 7.2% |