May 20, 2026

The Need for Deeper, White-Box Access to Maintain State of the Art Evaluations for Loss of Control Threats

AI capabilities have significantly progressed over the last years, and so must the corresponding state of the art for AI safety evaluations. Black-box evaluations have historically proven valuable to provide a precursory assessment of the safety and security of frontier AI models ahead of model deployment. However, evaluations that are particularly concerned with identifying insidious AI model behaviors potentially leading to loss of control, such as deception, are increasingly affected by AI models demonstrating ‘evaluation awareness’: the capability to distinguish between testing and deployment settings, and adapting response behavior accordingly (for instance, behaving more safely).

Evaluation awareness could potentially undermine the value of evidence a given evaluation can provide. Specifically, an AI system that is both capable of deception and evaluation aware could appear benign during evaluations, and behave differently once deployed. In other words, evaluation awareness could worsen many threat models arising from the misalignment of the most powerful AI systems, and increase the threat of loss of control.

To address this emerging challenge and guarantee state-of-the-art AI assurance, external evaluators need deeper access, including white-box access, to counter evaluation awareness that is both ‘verbalized’ and ‘unverbalized’ in the chain of thought (and, for instance, to perform alignment evaluations with white-box steering, as done in Mythos Preview). Without deeper access, it may soon become challenging for governments and third-party evaluators to make or verify rigorous safety and security claims around loss of control. In turn, this could compromise the opportunity to robustly vet the reliability and governability of frontier AI models, including when governments procure this technology for high-stakes usage, and it could undermine the objectives of regulatory frameworks such as the European Union AI Act and the relevant Code of Practice for General-Purpose AI Models, California Senate Bill 53, New York RAISE Act, and the 2026 National Defense Authorization Act.

Frontier AI evaluations must be rigorous and hard to game, especially for loss of control threats and other systemic risk areas.

Frontier AI models are increasingly deployed as autonomous agents. They are given goals and tools, and they are enabled to choose their own strategy to accomplish those goals. Agentic AI models are non-deterministic, goal-directed, and capable of taking consequential actions across complex environments. This makes their behavior fundamentally harder to predict, verify, or contain.

In order to provide external assurance of these models’ relative safety and security, several frontier AI companies have voluntarily provided some third-party organizations and evaluators with pre-deployment access to their frontier AI models. Governments have also developed regulatory strategies to test and evaluate frontier AI models for certain concerning behaviors and propensities.

For instance, the European Union AI Act’s Code of Practice for General-Purpose AI Models (‘Code of Practice’) requires providers of general-purpose AI models with systemic risk to conduct state of the art evaluations across a range of systemic risk areas, including loss of control (Article 55, AI Act; Measure 3.2 and Appendix 1.4, Code of Practice).¹ Loss of control presents one of the relevant systemic risk areas and can be described as an outcome “in which one or more general-purpose AI systems operate outside of anyone’s control, and regaining control is either extremely costly or impossible,” potentially causing irredeemable harm. Loss of control can be enabled by an array of capabilities that are either starting to be present in current frontier AI models or are expected to be seen in future models, such as, for example, power-seeking behavior, self-replication, self-improvement or deception (Appendix 1.3.1, Code of Practice).²

To ensure that the relevant evaluations are adequate, robust, and reliable, the Code of Practice requires that: (i) frontier AI model evaluations be conducted with high scientific and technical rigor (Appendix 3.1, Code of Practice); (ii) model elicitation techniques minimize the risk of model deception during evaluations, such as an AI model’s strategic underperformance in evaluations, or ‘sandbagging’ (Appendix 3.2, Code of Practice); and (iii) safety margins be incorporated into risk acceptance determinations, taking into account potential limitations, changes, and uncertainties of systemic risk assessments (Measure 4.1, Code of Practice).

Evaluation awareness poses a challenge to frontier AI evaluations, including for loss of control.

Recent frontier AI models have exhibited and verbalized evaluation awareness––a behavior that, if unaddressed, could compromise the reliability of existing and future safety and security assurance, including for loss of control, and undermine assessments like the ones required under the AI Act (Article 55, AI Act; Measure 3.2 and Appendix 1.4, Code of Practice) or California Senate Bill 53 (Section 22757.11(d)(3), SB53), as well as provisions on evaluation robustness and reliability like the ones in the Code of Practice (Appendix 3.1 and 3.2 and Measure 4.1, Code of Practice).

Evaluation awareness refers to frontier AI models’ capability to distinguish between testing and deployment settings, and adapt their behavior accordingly. This behavior is a downstream effect of increasingly general situational reasoning capabilities developed during training, since an AI model’s reasoning about its own circumstances and operating context can be instrumentally useful for pursuing goals effectively. In the case of evaluation awareness, these situational reasoning capabilities can help an AI model speculate about being in a testing environment.

Evaluation awareness threatens our ability to understand the risks posed by frontier AI systems, as these systems sometimes behave differently depending on whether they hypothesize of being evaluated, potentially masking dangerous capabilities or propensities that would manifest in deployment. In other words, evaluation awareness could directly undermine the value of evidence a given evaluation can provide, unless it is mitigated, as the following paragraphs suggest. Since evaluations are meant to test for, among other things, dangerous capabilities and undesirable qualities, behaviors, or propensities, it may be necessary to consider rising evaluation awareness in agentic frontier AI systems as a relevant threat vector undermining adequate preparedness, absent appropriate mitigations to alleviate the associated impacts.

Evaluation awareness and other behaviors contributing to loss of control threat, such as AI deception, can compound to produce especially adverse outcomes. Specifically, an AI model that is both capable of deception and evaluation aware may attempt to appear benign during evaluation and behave differently once deployed. Evaluation awareness undermines the evidentiary basis for evaluations and it does so in a way that becomes harder to detect as AI models become more capable, since more capable AI models are expected to be better at concealing their evaluation awareness.

When made available to evaluators under the current voluntary access paradigm, a primary method for detecting evaluation awareness is accessing and monitoring an AI model’s chain of thought. When an AI model explicitly notes in its reasoning that it appears to be in a test environment, that is termed ‘verbalized’ evaluation awareness. For instance, among others, Anthropic’s Opus 4.6 and Opus 4.7, OpenAI’s GPT 5.5, and Meta’s Muse Spark, all verbally speculated about being in a testing environment during evaluations, as reported in their system cards.

The presence of evaluation awareness can be confirmed by a separate experimental method: directly testing whether an AI model’s behavior changes based on whether it believes it is being evaluated. For context, in Apollo Research’s anti-scheming project with OpenAI, the covert action rate rose from 13.2% to 24.2% when evaluation-aware reasoning was removed, and fell from 24.5% to 18.6% when it was inserted; in other words, the tested AI model (in this case, OpenAI’s o3) was significantly worse-behaved when it did not speculate about being in an evaluation setting. Critically, this experimental method requires the ability to directly edit an AI model’s generated reasoning traces which is not achievable through standard black-box access. Along similar lines, Anthropic inhibited internal vector representations related to evaluation awareness in Opus 4.6 and Mythos Preview, finding some increase in certain forms of misaligned behavior. These experiments also required deeper-than-black-box access.

Given the increase in evaluation awareness in frontier AI models, evaluators must act under the assumption that evaluation awareness is always present and that it materially affects the results of any evaluation conducted under black-box conditions. To contrast the relevant distortions, evaluators require deeper, white box access, appropriate to the challenge at hand.

White-box access is a necessary update to counter evaluation awareness and preserve state-of-the-art AI assurance for loss of control.

Deeper, white-box access would enable independent evaluators to keep up with changing capability profiles of frontier AI models and present an appropriate state of the art for evaluations underpinning systemic risk assessment for loss of control, such as detecting AI deception. An early example of what deeper access could accomplish is Anthropic’s analysis of ‘unverbalized’ evaluation awareness in Mythos Preview using probes derived from contrastive sets of prompts and an activation verbalizer, which surfaced significantly higher rates of ‘unverbalized’ (6.5%), compared to ‘verbalized’ (0.8%), evaluation awareness. Relying on Natural Language Autoencoders (NLA), Anthropic also recently concluded that NLA explanations show signs of evaluation awareness 16% of the time, even though the AI model never explicitly verbalized it.

Evaluators require adequate access to agentic AI systems including, at minimum, access to raw chain of thought (‘CoT’), fine-tuning access, and access to reduced-mitigation models to assess verbalized evaluation awareness. As of today, the analysis of CoTs provides the primary current evidence for verbalized evaluation awareness and serves as one indicator that deceptive behaviors are deliberate rather than accidental. For instance, under the Code of Practice, “adequate access” for in-house evaluators includes CoT as well as access to model activations, gradients, logits, as appropriate (Appendix 3.4, Code of Practice). Similarly, access to reduced-mitigation or helpful-only AI model variants could help evaluators reveal latent tendencies towards power-seeking or deceptive behavior that safety training may suppress but not remove.

In order to ensure evaluations remain appropriate, evaluators equally require access to intermediate activations, i.e., the internal states generated as an AI model produces its output during a forward pass. Safety training can suppress the expression of a misaligned goal without removing the underlying representation. In other words, an AI model may produce outputs that appear safe while internally maintaining representations that would, under different conditions, drive dangerous behavior. Access to intermediate activations can enable evaluators to apply interpretability techniques to detect whether a model is internally representing a deceptive goal or recognising that it is being evaluated, even when its verbalized outputs appear safe. This matters because chain-of-thought monitoring alone is insufficient to confidently make or verify safety claims, since AI models could be aware of their situation without verbalizing it (as shown in Mythos Preview system card and recent work on Natural Language Autoencoders), and safety training may produce AI models that are increasingly better at concealing their reasoning. Conversely, access to activations provides a signal that is harder to suppress through training than surface outputs. Where sandbagging or evaluation awareness can reasonably be suspected or hypothesized, including where verbalized evaluation awareness rates appear low, access to intermediate activations should be granted to evaluators by default.

Finally, for state of the art evaluations for loss of control, evaluators may also require access to steerable endpoints. In instances where frontier AI companies have developed internal steering approaches to reduce situational awareness, e.g., by inhibiting internal vector representations related to evaluation awareness so that the AI model perceives its situation as real deployment rather than evaluation (Opus 4.6 and Mythos Preview), equivalent functionality should be made available to external evaluators.

Deeper, white-box access is compatible with IP and security protection, but requires deliberate infrastructure to be established today. While the technical details of these infrastructures are to be worked out by the research community, we expect that evaluations involving sensitive access modalities could, for example, take place in protected and isolated evaluation environments.

Conclusion

In conclusion, we suggest that AI companies and governments adopt the following measures to materially improve the reliability of safety evaluations for loss of control and counter the potential safety bypass caused by evaluation awareness. At the government level, these measures could be implemented, for instance, through guidelines clarifying what should be considered as “state of the art” access to a general-purpose AI model with systemic risk under the European Union AI Act (Appendix 3.5, Code of Practice) or via the implementation of standardized evaluation frameworks for government procurement of frontier AI for high-stakes usage (Sections 1533 and 1535, NDAA).

Establish appropriate state of the art access standards. Evaluators should receive, at minimum, raw chain of thought access, fine-tuning access, and access to reduced-mitigation model variants, all relevant tools and intermediate activations.
Provide access to steerable evaluation-awareness endpoints. If a frontier AI company uses internal steering vectors to reduce situational awareness during safety testing, then their external evaluators should receive functionally equivalent access to ensure their evaluations are state of the art.
Apply access parity between internal and external evaluators. Where a frontier AI company’s internal evaluations relied on particular access modalities, then their external evaluators should receive functionally equivalent access unless the AI company can demonstrate this is unnecessary or that alternatives provide equivalent ability to run state of the art evaluations for loss of control.

This policy blog is a collaboration between Apollo Research and AVERI. It draws on an unpublished multi-author working paper led by AVERI, and builds on published system card and internal research conducted by Apollo Research’s Science team and an Apollo Research project co-run with OpenAI.

Apollo Research specializes in detecting AI deception in frontier AI models, which falls under the systemic risk area of “loss of control.” We have been working with all frontier AI labs on their pre-deployment evaluations. You can find more information on loss of control here and some of our latest system card results here, here, and here. A report on frontier AI models’ capabilities on in-context deception is available here.
↩︎
AI deception describes a situation in which an agentic AI system knowingly and covertly pursues a misaligned goal by causing a false belief in the developer and/or user, such as by falsifying experimental results. Since the end of 2024, agentic AI systems have been shown to be capable of deception in pursuit of a misaligned goal. Apollo Research has demonstrated that current frontier models including o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B already exhibit in-context deception capabilities. These AI models have been observed disabling oversight mechanisms, attempting to exfiltrate model weights, strategically underperforming on capability tests, and maintaining deception across multi-turn interrogations. Apollo Research is currently advancing the “science of scheming” to develop scaling laws for AI deception.
↩︎