15/08/2024

An Opinionated Evals Reading List

We have greatly benefited from reading others’ papers to improve our research ideas, conceptual understanding, and concrete evaluations. Thus, the Apollo Research evaluations team compiled a list of what we felt were important evaluation-related papers. We likely missed some relevant papers, and our recommendations reflect our personal opinions.

Our favorite papers

  • Evaluating Frontier Models for Dangerous Capabilities (Phuong et al., 2024)
  • Observational Scaling Laws and the Predictability of Language Model Performance (Ruan et al., 2024)
    • They find that it is possible to find a low-rank decomposition of models’ capabilities from observed benchmark performances. These can be used to predict the performance of bigger models in the same family.
    • Marius: I think this is the most exciting “science of evals” paper to date. It made me more optimistic about predicting the performance of future models on individual tasks.
  • The Llama 3 Herd of Models (Meta, 2024)
    • Describes the training procedure of the Llama 3.1 family in detail
    • We think this is the most detailed description of how state-of-the-art LLMs are trained to date, and it provides a lot of context that is helpful background knowledge for any kind of evals work.
  • Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al., 2022)
    • Shows how to use LLMs to automatically create large evals datasets. Creates 154 benchmarks on different topics. We think this idea has been highly influential and thus highlight the paper.
    • The original paper used Claude-0.5 to generate the datasets, meaning the resulting data is not very high quality. Also, the methodology section of the paper is much more confusingly written than it needs to be. 
    • For an improved methodology and pipeline for model-written evals, see Dev et al., 2024 or ARENA chapter 3.2 (disclosure: Apollo involvement). 
  • Evaluating Language-Model Agents on Realistic Autonomous Tasks (Kinniment et al., 2023)
    • Introduces LM agent evals for model autonomy. It’s the first paper that rigorously evaluated LM agents for risks related to loss of control, thus worth highlighting.
    • We recommend reading the Appendix as a starting point for understanding agent-based evaluations. 

Other evals-related publications

LM agents

Core:

Other:

Benchmarks

Other:

Science of evals

Core:

Other:

Software

Core:

  • Inspect
  • Vivaria
    • METR’s open-sourced evals tool for LM agents
    • Especially optimized for LM agent evals and the METR task standard
  • Aider

Other:

Miscellaneous

Core:

Other:

Related papers from other fields

Red teaming

Core:

Other:

Scalable oversight

Core:

Other:

Scaling laws & emergent behaviors

Core:

Other:

Science tutorials

Core:

  • Research as a Stochastic Decision Process (Steinhardt)
    • Argues that you should do experiments in the order that maximizes information gained.
    • We use this principle all the time and think it’s very important.
  • Tips for Empirical Alignment Research (Ethan Perez, 2024),
    • Detailed description of what success in empirical alignment research can look like
    • We think it’s a great resource and aligns well with our own approach.
  • You and your Research (Hamming, 1986)
    • Famous classic by Hamming. “What are the important problems of your field? And why are you not working on them?”

Other:

LLM capabilities

Core:

Other:

RLHF

Core:

Other:

Supervised Finetuning/Training & Prompting

Core:

Other:

Fairness, bias, and accountability

AI Governance

Core: 

Other:

Contributions

The first draft of the list was based on a combination of various other reading lists that Marius Hobbhahn and Jérémy Scheurer had previously written. Marius wrote most of the final draft with detailed input from Jérémy and high-level input from Mikita Balesni, Rusheb Shah, and Alex Meinke.