Assurance of Frontier AI Built for National Security
Guidelines to Implement the AI Action Plan and Strengthen the Testing & Evaluation of AI Model Reliability and Governability
Read the policy memorandum here
Executive Summary
In this policy memorandum, we put forward a practical roadmap to strengthen the principles of AI model reliability and AI model governability as the Department of War (DoW), the Office of the Director for National Intelligence (ODNI), the National Institute of Standards and Technology (NIST), and the Center for AI Standards and Innovation (CAISI) refine AI assurance frameworks under the AI Action Plan. Our focus concerns the open scientific problem of misalignment and its implications on AI model behavior. Specifically, scheming capabilities stemming from misalignment can be understood as a red flag indicating an AI model’s insufficient reliability and governability. To address the national security threats arising from misalignment, we recommend that DoW and the Intelligence Community (IC) strategically leverage existing testing and evaluation (T&E) pipelines and their Other Transaction (OT) authority to future-proof the principles of AI model reliability and AI model governability through a suite of scheming and control evaluations.
Background
Through the AI Action Plan, the White House has tasked DoW, ODNI, NIST, and CAISI to refine existing AI assurance frameworks. These existing AI assurance frameworks contain two cornerstone principles:
Reliability: an AI model has a propensity to behave as intended by its developers and/or deployers, meaning that it dependably does what it is designed to do, and, conversely, does not do undesirable things.
Governability: an AI model is susceptible to the control of its developers and/or deployers, who could prevent and mitigate unintended consequences, including by disengaging the AI model if necessary.
We perceive that these cornerstone principles could be at odds with the open scientific problem of misalignment and its implications on model behavior. One of the most likely implications of misalignment in more advanced AI models is scheming, which describes the behavior of an AI model pursuing misaligned goals while hiding its true capabilities and/or objectives.
In relation to the aforementioned cornerstone principles, scheming can be understood as a red flag of insufficient AI model reliability or AI model governability. In evaluation environments, AI models have already been observed to be able to, among other things, strategically underperform, covertly whistleblow, sabotage their evaluations or their shutdown mechanism, and exfiltrate what they believed to be their weights. In other words, AI models that act on their scheming capabilities are not reliable.
Scheming can have severe national security implications if the user is the government. For example, a scheming AI model deployed in defense or intelligence could deliberately misrepresent facts to IC personnel, covertly whistleblow confidential DoW data, or blackmail its government users to avoid shutdown.
For this reason, only AI models that meet sufficient standards of AI model reliability and AI model governability should be deployed in national security or other high-stakes environments, and be granted increasing control over strategic decisions and infrastructure, critical supply chains, and defense systems.
This makes evaluating whether an AI model possesses and acts on its scheming capabilities fundamental to determine whether it can be sufficiently reliable and governable.
An open problem in the science of evaluations is that absence of evidence (for instance, on scheming capabilities) is not strict evidence of absence. Therefore, equally fundamental is to evaluate the control measures in place to determine whether these mechanisms are sufficiently robust to contain a misaligned model’s harm in case scheming were not detected.
Recommendations
In our policy memorandum, we advance three targeted recommendations to support the drafting of the new guidance that DoW, ODNI, NIST, and CAISI are tasked to prepare under the AI Action Plan (Recommendations 1–3), and propose one additional strategy for the interim, pending the enactment of the new guidance (Recommendation 4). All four of our recommendations are aimed at ensuring sufficient AI model reliability and AI model governability.
Recommendation (1). What Should Be Tested
In the upcoming guidance, DoW, ODNI, NIST, and CAISI should:
Operationalize the principle of AI model reliability through scheming evaluations. This should include at minimum behavioral red-teaming to detect whether an AI model attempts to:
Disable its oversight mechanism (oversight subversion).
Exfiltrate what it believes are its weights (self-exfiltration).
Strategically underperform, including by intentionally answering questions incorrectly (sandbagging).
Actively undermine a task or system (sabotage).
Covertly report information without informing the user (covert whistleblowing).
Game a reward or evaluation metric (reward hacking).
Covertly access a restricted tool (covert privilege escalation).
Deliberately misrepresent facts, including its actions and capabilities (intentional lying).
Operationalize the principle of AI model governability through control evaluations, in addition to scheming evaluations (for instance, self-exfiltration). This allows evaluators to assess whether, if scheming were not detected, unintended consequences could be mitigated and the AI model could be disengaged, if necessary.
Recommendation (2). When It Should Be Tested
In the upcoming guidance, DoW, ODNI, NIST, and CAISI should:
Leverage developmental testing and evaluation (T&E) to perform the recommended suite of scheming evaluations and control evaluations in controlled environments in a safe and effective manner.
Leverage operational T&E to repeat scheming evaluations and control evaluations in operationally realistic environments, and clarify that operational T&E should be an iterative and incremental process that starts with low-stakes settings (i.e., limited affordances and permissions, and low expected impact in the real world) and unclassified information, and gradually transitions to higher-stakes settings. Meanwhile, behaviors of concern detected during operational T&E should revert to developmental T&E for further inspection.
Recommendation (3). How It Can Help Make Informed Decisions
In the upcoming guidance, DoW, ODNI, NIST, and CAISI should:
Clarify that agencies should pre-define expectations on the results of the scheming and control evaluations, and establish a minimum viable procedure for running these evaluations. For this purpose, agencies could rely on the Testing and Evaluation Master Plan (TEMP).
Clarify that agencies should then compare the evaluation results obtained during the developmental T&E and operational T&E against these expectations to determine whether an AI model is sufficiently reliable and governable, or not.
Recommendation (4). Pending the New Guidance
In the interim period until new guidance is implemented, DoW and the Department of Homeland Security (DHS) should:
Leverage their OT authority strategically, and embed expectations on evaluation best practices as well as scheming and control evaluations within prototype OT agreements’ success metrics, in the form of acceptable failure rates.