January 22, 2024

We Need A ‘Science of Evals’

In this post, we argue that if AI model evaluations (evals) want to have meaningful real-world impact, we need a “Science of Evals”, i.e. the field needs rigorous scientific processes that provide more confidence in evals methodology and results.

Model evaluations allow us to reduce uncertainty about properties of Neural Networks and thereby inform safety-related decisions. For example, evals underpin many Responsible Scaling Policies and future laws might directly link risk thresholds to specific evals. Thus, we need to ensure that we accurately measure the targeted property and we can trust the results from model evaluations. This is particularly important when a decision not to deploy the AI system could lead to significant financial implications for AI companies, e.g. when these companies then fight these decisions in court.

Evals are a nascent field and we think current evaluations are not yet resistant to this level of scrutiny. Thus, we cannot trust the results of evals as much as we would in a mature field. For instance, one of the biggest challenges Language Model (LM) evaluations currently face is the model’s sensitivity to the prompts used to elicit a certain capability (Liang et al., 2022; Mizrahi et al., 2023; Scalar et al., 2023; Weber et al., 2023, Bsharat et al., 2023). Scalar et al., 2023, for example, find that “several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points […]“. A post by Anthropic also suggests that simple formatting changes to an evaluation, such as “changing the options from (A) to (1) or changing the parentheses from (A) to [A], or adding an extra space between the option and the answer can lead to a ~5 percentage point change in accuracy on the evaluation.” As an extreme example, Bsharat et al., 2023 find that “tipping a language model 300K for a better solution” leads to increased capabilities. Overall, this suggests that under current practices, evaluations are much more an art than a science.

Since evals often aim to estimate an upper bound of capabilities, it is important to understand how to elicit maximal rather than average capabilities. Different improvements to prompt engineering have continuously raised the bar and thus make it hard to estimate whether any particular negative result is meaningful or whether it could be invalidated by a better technique. For example, prompting techniques such as Chain-of-Thought prompting (Wei et al, 2022), Tree of Thought prompting (Yao et al., 2023), or self-consistency prompting (Wang et al. 2022), show how LM capabilities can greatly be improved with principled prompts compared to previous prompting techniques. To point to a more recent example, the newly released Gemini Ultra model (Gemini Team Google, 2023) achieved a new state-of-the-art result on MMLU with a new inference technique called uncertainty-routed chain-of-thought, outperforming even GPT-4. However, when doing inference with chain-of-thought@32 (sampling 32 results and taking the majority vote), GPT-4 still outperforms Gemini Ultra. Days later, Microsoft introduced a new prompting technique called Medprompt (Nori et al., 2023), which again yielded a new Sota result on MMLU, barely outperforming Gemini Ultra. These examples should overall illustrate that it is hard to make high-confidence statements about maximal capabilities with current evaluation techniques.

In contrast, even everyday products like shoes undergo extensive testing, such as repeated bending to assess material fatigue. For higher-stake things like airplanes, the testing is even more rigorous, examining materials for fatigue resistance, flexibility, and behavior under varying temperatures and humidity. These robust, predictive tests provide consumers with reliable safety assurances.

We think model evaluations should aim to get into a similar stage to close the gap between what evals need to do and what they are currently capable of.

This post is a call for action and coordination and not a solution outline. We are interested in collaborating with academics, industry, other third-party auditors, lawmakers, funders, individual contributors, and others to build up the field of Science of Evals and have concrete projects to suggest. If you’re interested, please reach out.

What do we mean by “Science of Evals”?

Most scientific and engineering fields go through a similar maturation cycle where exploratory research is turned into robust and rigorous standards that can be used as the basis for industry norms and laws. If model evaluations want to be impactful for AI safety, we think they have to go through a similar maturation process. Without confidence in the rigor of the evaluation process, people won’t feel confident in their results which makes them unsuitable for high-stakes decisions.

Different parts of model evaluations are at different stages of that process. For example, the fairness, bias, and accountability community has done evaluations and benchmarking for many years and has developed novel evaluation techniques (e.g. Buolamwini and Gebru, 2018), developed concrete voluntary standards such as model cards (Mitchell et al., 2018), and has proposed frameworks for algorithmic auditing (e.g. Raji et al., 2020).

Benchmarking, in general, has also been intricately linked to progress in machine learning since its infancy, classical benchmarks such as CIFAR-10 (Krizhevsk, 2009), and ImagenNet (Russakovsky et al., 2014) in vision domains, Atari games (Bellemare et al., 2019) in RL, or GLUE (Wang et al., 2018), MMLU (Hendrycks et al., 2020), HELM (Liang et al., 2022) and Big Bench (Srivastava et al., 2022) in the language domain.

Nevertheless, the field of evaluations, particularly for dangerous capabilities in LMs, is still very young. Since these evaluations will drive policy decisions, be tied to responsible scaling policies, and generally be used as a tool for measuring capabilities and risks of AI systems, it is vital to make this field more rigorous.

Maturation process of a field

**Figure 1:** Maturation process from nascent to mature field. The suggestions are not exhaustive.

In a nascent field, there are no or few agreed-upon best practices and standards. Most of the work is invested in researching and exploring different procedures. Basic terms are defined and the overall field is mostly a scientific endeavor. This may be comparable with the Early days of aviation at the beginning of the 20th century. During these days, pioneers were mainly focused on getting planes to reliably fly at all.

In the maturation phase, there is an informal agreement on norms and best practices between different stakeholders but cannot yet be used as the basis of comprehensive laws. Many stakeholders are involved in this process, e.g. academia, policymakers, AI companies, AI auditors, and civil society. This is analogous to the aviation industry between ~1920 and ~1950. The technology began to mature and different stakeholders began to agree on basic standards (see e.g. US Air Commerce Act of 1926). The establishment of the International Civil Aviation Organization (ICAO) in 1944 marked a significant step towards formalizing aviation standards but these standards were not yet universally adopted.

In a mature field, there are formal standards, best practices, and statistical confidence estimates. The field is “ready for law” in the sense that there are clearly established and widely agreed-upon evaluation techniques and the consequences of any particular result are well-defined. This would be comparable to the aviation industry after ~1960 where safety standards started to be backed by statistical analysis and regulatory frameworks became more detailed and were enforced.

The mature field should be able to answer questions such as:

What precisely does the evaluation measure, e.g. do we have strong agreement that this is a good way to measure any particular concept?
How large is the “coverage” of the evaluation, i.e. how large is the potential input space of the system that was covered? For example, an evaluation that would use only one prompt without any paraphrases or semantically similar prompts has minimal coverage.
How robust and general are the results of and evals, i.e. how does the system perform under different conditions or in varied contexts (e.g. out of distribution)?
How can we increase the replicability and reliability of evals, e.g. consistently get the same results for reruns?
Does the evaluation make some form of statistical guarantee, e.g. can we make a statement like “the bad outcome is less likely than 1 in 10000 under normal use conditions?
How accurate are the predictions about future systems when using the results of evaluations on current systems?

Current work in the direction of Science of Evals

To convey a better intuitive understanding of what Science of Evals could look like, we list a few papers that we think broadly go in the direction we have in mind. This is not the result of an exhaustive search, so there might be papers that fit our description better–we’re mostly trying to provide a flavor ¹.

Many different papers such as Liang et al., 2022; Mizrahi et al., 2023; Scalar et al., 2023 find that different phrasings of the same question can lead to very different results thus suggesting to always evaluate LMs on a set of diverse prompts.
Multiple papers investigate how different ways of structuring an evaluation, e.g. as multiple choice or generative evaluation, can lead to substantially different results, e.g. Robinson et al., 2022; Wang et al., 2023; Savelka et al., 2023, Khatun et al, 2024. Since model evaluations often try to make statements about the maximal capability of a model, it’s important to be aware of how a question is structured (e.g. discriminative vs. generative evaluation) and worded.
Several papers investigated the relationship between fine-tuning and prompting which has important implications for capability elicitation, e.g. C. Wang et al., 2022; Liu et al., 2022, and Lin et al. 2023.
“With Little Power Comes Great Responsibility” (Card et al., 2023) investigates the statistical significance of typical ML experiments. This is a good example of how more rigorous hypothesis testing could be applied to evaluations.
“Are emergent capabilities a mirage?” (Schaeffer et al., 2023) argue that previously reported emergent capabilities of LMs (Wei et al., 2022, Srivastava et al. 2022) primarily depend on the metric used, e.g. accuracy vs. log-likelihood. While these flaws had already been recognized by Wei et al., 2022 and Srivastava et al. 2022, it is very valuable to rigorously understand how the choice of metric can influence the perceived capabilities.
“True few-shot learning with Language Models” (Perez et al., 2021) argues that common few-shot techniques at the time would bias the results and thus overestimate the true abilities of LMs. Concretely, many evaluations would select few-shot examples based on a held-out validation set, instead of randomly sampling them. This emphasizes the importance of adequately designing the evals, e.g. not accidentally leaking information from the test set.
“Elo Uncovered: Robustness and Best Practices in Language Model Evaluation” (Boubdir et al., 2023) investigates whether the commonly used ELO ranking to compare LMs (Zheng et al. 2023) fulfills two core desiderata, reliability and transitivity, in practice. Thus it is a good example of empirically validating evals methodology.
Model evaluation survey papers like Chang et al., 2023 summarize the state of the field, discuss trends and examples, and explicitly call for model evaluations as an explicit discipline. Zhang et al. 2023 and Ivanova 2023‘s “Running cognitive evaluations on large language models: The do’s and the don’ts” are initial work in meta-evaluating model evaluations as a field and proposing concrete recommendations.

Next steps

Below, we provide suggestions for how the Science of Evals ecosystem could be grown as well as open research questions.

Field building

Naturally, a scientific field grows through active research efforts. Thus, we recommend conducting and supporting Science of Evals research efforts. This research could happen in academia, industry, non-profit organizations, or collaborations between them. Historically, academics have produced most of the benchmarks that drove the ML field forward, e.g. ImageNet (Russakovsky et al., 2014) or MMLU (Hendrycks et al. 2020), but more extensive evals efforts might require large teams and resources, so we think industry has unique opportunities for contribution as well.

To support the science of evals research ecosystem, we suggest (i) that funding bodies like the National Science Foundation, NIST, various AI safety institutes, and others, support researchers with grants and (ii) that big labs provide research funding for Science of Evals since they also profit a lot from improved evals methodology.

Since Science of Evals should eventually answer questions relevant to regulatory bodies, industry, and academics, we think it is important to seek collaboration between these different groups early on. The people conducting the research should be aware of what lawmakers need and lawmakers need to better understand the state of the technology. This exchange of ideas, e.g. in the form of workshops, can take many different forms, e.g. academics and industry representatives might define basic terms so that we at least talk and write in compatible language.

We can learn from one of the many other disciplines that have gone through a similar maturing process and have developed different measures to become more confident in the results of their experiments. For example, physicists sometimes use five standard deviations (5 sigma) as their threshold for significance and statistical hypothesis testing is typical in most quantitative disciplines outside of ML. Science of evals could learn from these disciplines and adopt some of their best practices.

Open research questions

Optimally, evals substantially increase or decrease our confidence about whether a model has or doesn’t have a property or tendency. Thus, we think there are two main questions that Science of Evals needs to address: (i) Are we measuring the right quantity? (ii) To what extent can we trust our results?

For both, we propose a few preliminary open research questions.

Are we measuring the right quantity?
- Is there a systematic way to get conceptual clarity about the thing we’re trying to measure? How can we ensure not ending up with bad proxies and overfit to the wrong concepts? For example, when we want to measure whether models use violent language, how do we make sure we don’t overfit to the most salient violent examples while ignoring more subtle forms of violence?
- Assuming we have a working definition of our concept, how can we quantify our confidence that it captures what we want it to capture? For example, can we find a systematic procedure to adversarially test or red-team our current definition such that we identify edge cases? To what extent can this procedure be automated?
- What kind of statement are we trying to make with our eval? For example, are we testing for worst-case or average-case behavior?
Are our results trustworthy?
- Do we have broad coverage of the distribution in question? Can we find techniques to ensure wide coverage reliably at different levels of abstraction (e.g. rewording a question vs. asking an entirely different question about the same property)? Can we quantify the confidence in our coverage somehow?
- How can we measure specific biases in our evaluation? For example, are we accidentally leaking information from the validation set through our procedure? Are there automatic techniques to identify sources of bias, e.g. through statistical anomaly detection?
- Can we define hypothesis testing procedures that provide us with an estimate of statistical significance? For example, can we make statements like “Under these assumptions, the model is above the allowed threshold with a p-value lower than 1e-5”?
- How large is the difference between different elicitation techniques, e.g. prompt engineering vs. fine-tuning? Can we make systematic statements about these differences, e.g. akin to scaling laws?

Note: One way of quantifying aspects of Science of Evals adequately might be asking “How much optimization power is needed to elicit that behavior”. For example, we might make empirical statements like it required 1e10 FLOP to elicit the first occurrence of the concept using state-of-the-art elicitation methods. Being able to make statements like these would allow us to quantify the “safety” of AI systems for specific concepts more quantitatively ².

Conclusion

Evals are going to be an important piece of ensuring the safety of AI systems. They enable us to improve our decision-making because they provide important information and are already tied to safety-related decisions, e.g. in RSPs. For evals to be useful for high-stakes decisions, we need to have high trust in the evals process. Evals is a nascent field and we think a “Science of Evals” would greatly accelerate the maturation process of the field and allow us to make higher-confidence statements about the results of evals.

If you think further papers should be added to this list, feel free to contact us. ↩︎
We intend to share a more detailed explanation of this idea soon. ↩︎