21/03/2024

Our Work Advancing Scientific Understanding To Foster An Effective International Evaluation Ecosystem

Apollo Research’s commitment to international AI Governance in the United States

Evaluations of AI systems for dangerous capabilities and misalignment with human intentions prove increasingly central to emerging international governance regimes. In support of this, knowledge-sharing and upskilling between evaluation stakeholders is quickly becoming vital.

Apollo Research is therefore expanding its engagement internationally. As a first step, we are engaging with relevant US efforts such as the US AI Safety Institute Consortium and National Institute for Standards and Technology (NIST). As part of this broader engagement, we recently visited the United States to meet with government officials, private and public sector AI evaluators, think tanks and AI companies. We discussed the strengths and limitations of evaluations, demonstrated findings from our evaluations of advanced AI systems for deceptive capabilities, and discussed implications for building a robust domestic and international evaluation ecosystem.

Participation in the US AI Safety Consortium and NIST Working Groups

Following the White House Executive Order on Artificial Intelligence (EO 14110), we are delighted to see the US establish the AI Safety Institute (AISI) and to announce our participation in the US AI Safety Institute Consortium.

As a member of the US AI Safety Institute Consortium, Apollo Research will be participating in a number of corresponding NIST working groups, including: risk management for generative AI; capability evaluations; AI red-teaming; and safety and security.

These activities are part of a wider effort we are making to engage in US standards for advanced AI systems and associated evaluation regimes. We believe that standardisation efforts are an effective lever to steer national and international governance frameworks, and that getting them right is key to a competitive marketplace of safe and secure AI systems.

Our engagements include participation in the annual iteration of the Berkeley General-Purpose AI Systems Profile, which was developed by the Berkeley Centre for Long-Term Cybersecurity (CLTC). The profile supplements the NIST AI RMF, and sets recommendations for managing risks from frontier AI systems – including through evaluation and red-teaming – which are updated annually.

We also shared Apollo Research’s experience evaluating advanced AI systems in our response to the recent NIST Request for Information concerning its assignment under Sections 4.1, 4.5, and 11 of the Executive Order on AI (EO 14110). In particular, we offered guidance on generative AI risk management, AI red-teaming and evaluations, as well as international collaborations. We also articulated the benefits of independent evaluations in delivering greater public confidence. Highlights from our recommendations include:

Align the evaluation process with good scientific practice, and include threat modelling, designing and implementing variations of an experiment to test for robustness, and enabling open scrutiny of methods and results by other researchers (as far as is risk-appropriate).
Develop a science of evaluations and progress mechanistic interpretability research. This is important for having more robust methods, better confidence in evaluation results, and unlocking ‘whitebox’ evaluations.
Mandate data collection by AI companies on incidents and harms when AI systems are in deployment, to proactively identify harms occurring at scale and take responsive action. We recommend that a government agency like the US AISI is best placed to mandate and oversee this data collection.
Leverage real-world data collection on incidents and harms to evaluate the evaluators. Where harms have occurred, review should take place on the effectiveness of the evaluation processes to which the AI system was subjected. This closing of the feedback loop between evaluations and real-world impact would contribute significantly to confidence on methodologies and subsequent development.
Lead development of international standards for dual-use foundation models. This should include standards on their development as well as standards on evaluations of advanced AI systems, including evaluator assurance requirements.

Broader engagement in the US policy and evaluation ecosystem

During our visit to the United States, we were invited to participate in several events and give presentations at workshops, including:

Frontier Model Evaluation Science Day hosted by RAND Corporation, and attended by US government officials, technical evaluators and AI companies. During the day, we led a session on evaluations for deception and deceptive capabilities;
AI Governance Forum hosted by the Center for a New American Security (CNAS), where we gave a presentation on evaluations;
“Missing evals” workshop hosted by Scale AI, bringing AI companies and evaluators together to scope out where further evaluations are needed.

We also had useful exchanges with a range of think tanks and academic bodies leading on AI, such as the Georgetown Centre for Security and Emerging Technology (CSET), the Carnegie Endowment for International Peace, as well as the Berkeley Centre for Long-Term Cybersecurity.

An international approach to AI assurance

AI systems will have impacts that transcend borders, which is why we are intent on keeping an international policy focus. We look forward to continued engagement with the US, as well as other jurisdictions and nations who are purposefully moving towards regimes for safe and secure AI.

If any of the topics above resonate and you would like to discuss these further, you can contact us.