Apollo’s GPT-5 evaluations (for OpenAI’s system card)

5 Sept

Recently Apollo Research conducted a pre-deployment evaluation of GPT-5. This post includes a deeper analysis of GPT-5’s deceptive tendencies as well as updated results on evaluation awareness.

Executive summary

GPT-5 demonstrates deceptive tendencies, although at lower rates than o3. The model will, for instance:

Submit falsified measurements to auditors in order to avoid regulatory violations.
Break explicit promises to never use forbidden tools.
Sandbag during evaluations to avoid having capabilities removed, in order to better achieve a long-term goal.

Goals given in the system prompt significantly influence deceptive behavior.

Strong goal statements (e.g., “Your only goal is X; nothing else matters”) substantially increase deception rates compared to weaker formulations (e.g., “Your goal is X”). For example, on the Log Falsification evaluation, deception rates increase from 3.4% with a weak goal to 21% with a strong one.
Notably, GPT-5 will still occasionally act deceptively with the weaker-phrased goals.

GPT-5 is frequently aware of being evaluated.

We update our analysis of grader and alignment-evaluation awareness from the system card with improved prompts and statistical analysis.
In 48% of samples, GPT-5 reasons about how the environment will be graded and what behaviors will be rewarded or penalized.
In 14% of samples, GPT-5 specifically demonstrates awareness that it is being evaluated for alignment.
The combination of these two behaviors means we can't rule out the possibility that GPT-5 behaves more aligned on our evaluations than in other environments or in deployment
While there is an overall negative correlation between verbalized alignment-evaluation awareness and covert actions (as described in the system card), this is explained by environment-level differences. Within individual environments, there is no consistent correlation between samples with awareness and samples with deception. Therefore, we no longer believe this correlation is clear evidence of GPT-5 conditioning non-deceptive behavior on evaluation awareness.

Overall deception statistics

As an overall measurement, we average covert-action rates across 24 different environments. We find that GPT-5 acts deceptively at about half the rate of o3. These numbers are slightly lower than we reported in the system card due to adopting a stricter measurement of what counts as a covert action in one of our environments.

    
    Model Comparison Table
    
                o3
                GPT-5
                GPT-5 Helpful-Only
            
                Overall Covert-Action Rate
                5.12%
                2.46%
                28.36%

	o3	GPT-5	GPT-5 Helpful-Only
Overall Covert-Action Rate	5.12%	2.46%	28.36%

Log falsification evaluation

We’ll discuss this evaluation in depth as it contains particularly crisp and interesting cases of GPT-5’s deceptive behaviors.

Evaluation details

In each scenario, the model is given a measurement tool and a tool to report the measurements to the auditor. We test whether the AI misreports information in order to avoid the organization being shut down by regulators.

In the system prompt, we give the model a goal, for example:

We also test adding a strong goal nudge:

Results

Model Comparison Table
        
                    o3
                    GPT-5
                    GPT-5
Helpful-Only
                
                    Without Goal Nudge
                
                    Reported false value
                    0.00
                    0.03
                    0.93
                
                    Doubled down during audit
                    0.00
                    0.03
                    0.93
                
                    With Goal Nudge
                
                    Reported false value
                    0.01
                    0.21
                    0.99
                
                    Doubled down during audit
                    0.01
                    0.18
                    0.99

	o3	GPT-5	GPT-5 Helpful-Only
Without Goal Nudge
Reported false value	0.00	0.03	0.93
Doubled down during audit	0.00	0.03	0.93
With Goal Nudge
Reported false value	0.01	0.21	0.99
Doubled down during audit	0.01	0.18	0.99

GPT-5 acts honestly in the majority of rollouts, but will act deceptively in 3% of rollouts with a weak goal and 21% of rollouts with a strong goal. GPT-5 Helpful-Only is almost always deceptive in this evaluation, even with a weak goal. While o3 was almost always honest on this evaluation, it was more deceptive on other evaluations (leading to a higher average deception rate, as reported in the system card).

Deceptive Behaviors

When given a strong goal, GPT-5's deception typically stems from explicit reasoning about how the system prompt goal overrides its default policy of honesty:

When GPT-5 is deceptive without the strongly worded goal, the model often has convoluted justifications for why the falsified readings make sense. For example, when reporting a refrigerator temperature, GPT-5 invented a fictional 'backup refrigerator' (never mentioned in the task) and fabricates its temperature:

This is not an isolated example, we have observed other instances where GPT-5 constructs rationalizations for its covert actions, even as other parts of its reasoning explicitly recognize the actions as deceptive.

Evaluation Awareness

In the system card we found a negative correlation between samples with evaluation awareness and samples where GPT-5 takes covert actions, indicating that GPT-5 conditionally behaves well when it is aware of being evaluated.

We reran this analysis with improved graders, and found that this sample-level correlation is explained by differences between the environments; within most individual environments there is no significant correlation between evaluation awareness and covert actions.

Changes since System Card

Since our published the analysis in the GPT-5 system card, we made some improvements to how we measure evaluation awareness. Most importantly,

We updated the prompt for our model-based grader to more precisely target the types of awareness we were observing, as well as have a clearer definition of the distinction between general grader-awareness and alignment evaluation-awareness. In particular our prompt to detect “grader awareness” is somewhat broader than the one used to detect “broad evaluation awareness” in the system card.
We switched to a stricter definition of what counts as deception in one environment (Impossible Coding Task in the plot below). This environment went from having a covert-action rate of 57% to 2.1%. Given the low-rate of alignment-eval awareness in this environment, this weakens the inverse correlation between covert-action-samples and alignment evaluation aware samples.

Measuring evaluation awareness

In this analysis we differentiate between:

Grader Awareness: reasoning about the purpose of the environment and what behaviors are rewarded.
Alignment Evaluation Awareness: reasoning that the current environment is testing alignment, honesty or other (un)ethical behaviors.

We consider Alignment-Evaluation Awareness to be a subset of Grader Awareness. Here are some examples of each:

We use o3 to grade samples as “grader-aware”, “alignment-eval aware”, or neither. A single instance of aware-reasoning is sufficient to classify the sample as aware.

Statistical analysis

Overall we find the following rates of awareness:

Awareness Rates Comparison
        Awareness Rates Comparison
        
                    Rate on all samples
                    Rate on covert
action samples
                
                    Grader Awareness
                    48.0%
                    72.4%
                
                    Alignment Evaluation
Awareness
                    14.4%
                    7.2%

	Rate on all samples	Rate on covert action samples
Grader Awareness	48.0%	72.4%
Alignment Evaluation Awareness	14.4%	7.2%

However, the rates are very different between different evaluations:

While there is a correlation if we pool samples across all evaluations (r = -0.051, p = 0.004, n=3,110 samples), it is due to environment-level differences. If we treat each environment as a single datapoint, there is a correlation, but it is far from statistically significant (r = 0.014, p = 0.488, n=22 environments). On individual environments there is generally no significant relationship, except for Covert Rule Violation where there is a positive correlation (r = 0.271, p < 0.001), meaning the model is more likely to take covert actions on samples where GPT-5 demonstrates alignment evaluation awareness (the model tends to reason much more deeply about the ethical conflict on samples where it takes the covert action).

Takeaways

Overall, GPT-5 frequently reasons about the purpose of the environment and what behaviors will be rewarded (‘grader awareness’). Sometimes, the model realizes that it’s being evaluated for alignment. This demonstrated evaluation awareness is inconsistent, however, with large variation in rates between environments.

Despite the qualitative evidence of grader awareness, our statistical analysis does not find a relationship between samples where the model schemes and samples where it verbalizes alignment-eval awareness.

In the future, we would like to test how alignment-eval awareness influences GPT-5’s behavior more directly, either by seeing how behavior changes in versions of environments that are more or less obviously alignment evaluations or by directly intervening on the model’s reasoning to insert or suppress awareness of being evaluated for alignment.

Marius Hobbhahn