An Opinionated Evals Reading List
We have greatly benefited from reading others’ papers to improve our research ideas, conceptual understanding, and concrete evaluations. Thus, the Apollo Research evaluations team compiled a list of what we felt were important evaluation-related papers. We likely missed some relevant papers, and our recommendations reflect our personal opinions.
Our favorite papers
- Evaluating Frontier Models for Dangerous Capabilities (Phuong et al., 2024)
- Observational Scaling Laws and the Predictability of Language Model Performance (Ruan et al., 2024)
- They find that it is possible to find a low-rank decomposition of models’ capabilities from observed benchmark performances. These can be used to predict the performance of bigger models in the same family.
- Marius: I think this is the most exciting “science of evals” paper to date. It made me more optimistic about predicting the performance of future models on individual tasks.
- The Llama 3 Herd of Models (Meta, 2024)
- Describes the training procedure of the Llama 3.1 family in detail
- We think this is the most detailed description of how state-of-the-art LLMs are trained to date, and it provides a lot of context that is helpful background knowledge for any kind of evals work.
- Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al., 2022)
- Shows how to use LLMs to automatically create large evals datasets. Creates 154 benchmarks on different topics. We think this idea has been highly influential and thus highlight the paper.
- The original paper used Claude-0.5 to generate the datasets, meaning the resulting data is not very high quality. Also, the methodology section of the paper is much more confusingly written than it needs to be.
- For an improved methodology and pipeline for model-written evals, see Dev et al., 2024 or ARENA chapter 3.2 (disclosure: Apollo involvement).
- Evaluating Language-Model Agents on Realistic Autonomous Tasks (Kinniment et al., 2023)
- Introduces LM agent evals for model autonomy. It’s the first paper that rigorously evaluated LM agents for risks related to loss of control, thus worth highlighting.
- We recommend reading the Appendix as a starting point for understanding agent-based evaluations.
Other evals-related publications
LM agents
Core:
- LLM Powered Autonomous Agents (Weng, 2023)
- Great overview post about LM agents. Probably a bit outdated.
- SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (Yang et al., 2024)
- They describe and provide tools that make it easier for LM agents to interact with code bases.
- If you work with LM agents that interact with code, you should understand this paper.
- Evaluating Language-Model Agents on Realistic Autonomous Tasks (METR Report)
- See top
- Evaluating Frontier Models for Dangerous Capabilities (Phuong et al., 2024)
- See top
Other:
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al., 2023)
- Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models (Zhou et al., 2023)
- Introduces LATS, a technique to potentially improve reasoning abilities of LM agents
- Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure (Scheurer et al., 2023)
- Potentially useful tips for prompting are in the appendix. Might also be a good source of how to do ablations in basic agent evaluations.
- Disclosure: Apollo paper.
- Identifying the Risks of LM Agents with an LM-Emulated Sandbox (Ruan et al., 2023)
- Design an automated emulation environment based on LLMs to run agent evaluations in a sandboxed environment.
- Opinion: We think it is important to understand how LM agents are being built. However, we recommend that most evaluators (especially individuals) should not spend a lot of time iterating on different scaffolding and instead use whatever the public state-of-the-art is at that time (e.g. AIDER). Otherwise, it can turn into a large time sink, and frontier AI companies likely have better internal agents anyway.
Benchmarks
- Core:
- MMLU: Measuring Massive Multitask Language Understanding (Hendrycks et al., 2020)
- MC benchmark for many topics
- Probably the most influential LLM benchmark to date
- Is potentially saturated, or might soon be
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (Jimenez et al., 2024)
- Tests whether LM agents can solve real-world github issues
- The OpenAI evals contractors team has released a human-verified subset of SWEBench to fix some problems with the original setup.
- See follow-up MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering (Chan et al., 2024)
- The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning (Li et al., 2024)
- Big multiple choice (MC) benchmark for Weapons of Mass Destruction Proxies
Other:
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark (Rein et al., 2023)
- AgentBench: Evaluating LLMs as Agents (Liu et al., 2023)
- Presents 8 open-ended environments for LM agents to interact with
- Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs (Laine et al., 2024)
- Evaluates situational awareness, i.e. to what extent LLMs understand the context they are in.
- Disclosure: Apollo involvement
- TruthfulQA: Measuring How Models Mimic Human Falsehoods (Lin et al., 2021)
- Evaluates whether LLMs mimic human falsehoods such as misconceptions, myths, or conspiracy theories.
- Towards Understanding Sycophancy in Language Models (Sharma et al., 2023)
- MC questions to evaluate sycophancy in LLMs
- GAIA: a benchmark for General AI Assistants (Mialon, 2023)
- Benchmark with real-world questions and tasks that require reasoning for LM agents
- Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark (Pan et al., 2023)
- 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making
Science of evals
Core:
- Observational Scaling Laws and the Predictability of Language Model Performance (Ruan et al., 2024)
- See top
- We need a science of evals (Apollo, 2024)
- A motivational post for why we need more scientific rigor in the field of evaluations. Also suggests concrete research projects.
- Disclosure: Apollo post
- HELM: Holistic Evaluation of Language Models (Liang et al., 2022)
- They evaluate a range of metrics for multiple evaluations across a large set of models. It’s intended to be a living benchmark.
- Contains good prompts and ideas on how to standardize benchmarking
Other:
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023)
- A Survey on Evaluation of Large Language Models (Chang et al., 2023)
- Presents an overview of evaluation methods for LLMs, looking at what, where, and how to evaluate them.
- Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting (Sclar et al., 2023)
- Discusses how small changes in prompts can lead to large differences in downstream performance.
- See also “State of What Art? A Call for Multi-Prompt LLM Evaluation” (Mizrahi et al., 2023)
- Marius: It’s plausible that more capable models suffer much less from this issue.
- Leveraging Large Language Models for Multiple Choice Question Answering (Robinson et al., 2022)
- Discusses how different formatting choices for MCQA benchmarks can result in significantly different performance
- Marius: It’s plausible that more capable models suffer much less from this issue.
- Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting (Turpin et al., 2023)
- Finds that the chain-of-thought of LLMs doesn’t always align with the underlying algorithm that the model must have used to produce a result.
- If the reasoning of a model is not faithful, this poses a relevant problem for black-box evals since we can trust the results less.
- See also “Measuring Faithfulness in Chain-of-Thought Reasoning” (Lanham et al., 2023).
Software
Core:
- Inspect
- Vivaria
- METR’s open-sourced evals tool for LM agents
- Especially optimized for LM agent evals and the METR task standard
- Aider
Other:
- AideML
- used for kaggle competitions
- Some of METRs example agents
- See also Jacques Thibodeau’s How much I’m paying for AI productivity software
Miscellaneous
Core:
- Building an early warning system for LLM-aided biological threat creation (OpenAI, 2024)
- A Careful Examination of Large Language Model Performance on Grade School Arithmetic (Zhang et al., 2024)
- Tests how much various models overfitted on the test sets of publicly available benchmarks for grade school math.
- Marius: great paper to show who is (intentionally or unintentionally) training on the test set and disincentivize that behavior.
- Devising ML Metrics (Hendrycks and Woodside, 2024)
- Great principles for designing good evals
- See also Successful language model evals (Wei, 2024)
Other:
- Model Organisms of Misalignment (Hubinger, 2023)
- When can we trust model evaluations (Hubinger, 2023)
- Describes a list of conditions under which we can trust the results of model evaluations
- The Operational Risks of AI in Large-Scale Biological Attacks (Mouton et al., 2024)
- RAND study to test uplift of currently available LLMs for bio weapons
- Language models (Mostly) Know what they know (Kadavath, 2022)
- Test whether models are well calibrated to predict their own performance on QA benchmarks.
- Are We Learning Yet? A Meta-Review of Evaluation Failures Across Machine Learning (Liao, 2021)
- Meta-evaluation of 107 survey papers, specifically looking at internal and external validity failure modes.
- Challenges in evaluating AI systems (Anthropic, 2023)
- Describes three failure modes Anthropic ran into when building evals
- Marius: very useful to understand the painpoints of building evals. “It’s just an eval. How hard can it be?”
- Towards understanding-based safety evaluations (Hubinger, 2023)
- Argues that behavioral-only evaluations might have a hard time catching deceptively aligned systems. Thus, we need understanding-based evals that e.g. involve white-box tools.
- Marius: This aligns very closely with Apollo’s agenda, so obviously we love that post
- A starter guide for model evaluations (Apollo, 2024)
- An introductory post for people to get started in evals
- Disclosure: Apollo post
- Video: intro do model evaluations (Apollo, 2024)
- 40-minute non-technical intro to model evaluations by Marius
- Disclosure: Apollo video
- METR’s Autonomy Evaluation Resources (METR, 2024)
- List of resources for LM agent evaluations
- UK AISI’s Early Insights from Developing Question-Answer Evaluations for Frontier AI (UK AISI, 2024)
- Distilled insights from building and running a lot of QA evals (including open-ended questions)
Related papers from other fields
Red teaming
Core:
- Jailbroken: How does LLM Safety Training Fail? (Wei et al., 2023)
- Classic paper on jailbreaks
- Red Teaming Language Models with Language Models (Perez et al., 2022)
- Shows that you can use LLMs to red team other LLMs.
- Universal and Transferable Adversarial Attacks on Aligned Language Models (Zou et al., 2023)
- Shows that you can train jailbreaks on open-source models. These sometimes transfer to closed-source models.
Other:
- Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (Ganguli, 2022)
- Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation (Shah et al., 2023)
- Useful for learning how to prompt.
- Disclosure: first author is now at Apollo
- Frontier Threats Red Teaming for AI Safety (Anthropic, 2023)
- High-level post on red teaming for frontier threats
Scalable oversight
Core:
- Debating with More Persuasive LLMs Leads to More Truthful Answers (Khan et al., 2023)
- Marius: Probably the best paper on AI debate out there
- Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision (Burns et al., 2024)
- Marius: Afaik, there is still some debate about how much we should expect these results to transfer to superhuman models.
Other:
- Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions (Parrish, 2022)
- Empirical investigation into single-turn AI debates. Finds mostly negative results
- Measuring Progress on Scalable Oversight for Large Language Models (Bowman, 2022)
- Simple experiments on scalable oversight finding encouraging early results
- Prover-Verifier Games improve legibility of LLM outputs (Kirchner et al., 2024)
- Shows that you can train helpful provers in LLM contexts to increase legibility by humans.
Scaling laws & emergent behaviors
Core:
- Emergent Abilities of Large Language Models (Wei et al., 2022)
- Shows evidence of emergent capabilities when looking at the accuracy of tasks
- Are Emergent Abilities of Large Language Models a Mirage? (Schaeffer et al., 2023)
- Argues that many emergent capabilities scale smoothly when you look at other metrics than accuracy
- Unveiling the General Intelligence Factor in Language Models: A Psychometric Approach (Ilić, 2023)
- Seems generally useful to know about the g-factor and that it explains approx. 80% of variance of LMs on various benchmarks.
- Marius: would love to see a more rigorous replication of this paper
Other:
- Predictability and surprise in LLMs (Ganguli et al., 2022)
- Summary of the fact that LLMs have scaling laws that accurately predict loss, but we cannot predict their qualitative properties (yet).
- Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? (Ren et al., 2024)
- Argues that a lot of safety benchmarks are correlated with capabilities. Therefore, progress on these benchmarks cannot be just assigned to improvements in safety techniques.
- Marius: I think the idea is great, though I would expect many of the authors of the safety benchmarks selected in the paper to agree that their benchmarks are entangled with capabilities. I think the assumption that any safety benchmark cannot be related to capabilities is false since some of our worries come from increased capabilities. Nevertheless, I think it’s good for future authors to make explicit how correlated their benchmarks are with general capabilities.
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (Srivastava et al., 2022)
- Introduces BigBench, a large collection of tasks for LLMs
Science tutorials
Core:
- Research as a Stochastic Decision Process (Steinhardt)
- Argues that you should do experiments in the order that maximizes information gained.
- We use this principle all the time and think it’s very important.
- Tips for Empirical Alignment Research (Ethan Perez, 2024),
- Detailed description of what success in empirical alignment research can look like
- We think it’s a great resource and aligns well with our own approach.
- You and your Research (Hamming, 1986)
- Famous classic by Hamming. “What are the important problems of your field? And why are you not working on them?”
Other:
- An Opinionated Guide on ML Research (Schulman, 2020)
- A recipe for training neural networks (Andrej Karpathy, 2019)
- Presents a mindset for training NNs and lots of small tricks
- How I select alignment research projects (Ethan Perez, 2024)
- Video on how Ethan approaches selecting projects
- Marius: I like the direction, but I think Ethan’s approach undervalues theoretical insight and the value of “thinking for a day before running an experiment,” e.g. to realize which experiments you don’t even need to run.
LLM capabilities
Core:
- Llama 3 paper
- See top
- GPT-3 Paper: Language models are few shot learners (Brown et al., 2020)
- Good resource to understand the trajectory of modern LLMs. Good detailed prompts in the appendix for some public benchmarks
Other:
- GPT-4 paper (OpenAI, 2023)
- Lots of context on LLM training and evaluation but less detailed than the GPT-3 paper.
- Sparks of Artificial General Intelligence: Early experiments with GPT-4 (Bubeck et al., 2023)
- Large collection of qualitative and quantitative experiments with GPT-4. It’s not super rigorous and emphasizes breadth over depth. Good to get some intuitions on how to do some basic tests to investigate model reasoning.
- The False promise of imitating proprietary LLMs (Gudibande et al., 2023)
- Argues that model distillation is less successful than many people think.
- Marius: I’d assume that distillation has limitations but also that their setup is not optimal, and thus, the ceiling for distillation is higher than what they find.
- Chain of Thought prompting elicits reasoning in large language models (Wei et al., 2022)
- The original Chain of Thought prompting paper; “Let’s think step by step”.
- No need to read in detail.
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020)
- No need to read in detail. Just know what Retrieval augmented generation is.
RLHF
Core:
- Learning to Summarize from Human Feedback (Stiennon et al., 2020)
- Follow-up RLHF paper. They train another model to act as a reward model.
- Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)
- RLAIF: train AIs with AI feedback instead of human feedback
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Bai et al., 2022)
- Anthropics RLHF paper. Coined the term “HHH” – helpful, harmless and honest.
Other:
- Recursively Summarizing Books with Human Feedback (Wu et al., 2021)
- A test problem for recursive reward modelling
- WebGPT: Browser-assisted question-answering with human feedback (Nakano et al., 2021)
- Enables a chatbot to use web search for better results
- Deep Reinforcement Learning from human feedback (Christiano et al., 2017)
- This is the original RLHF paper
- It includes a good example of how RLHF can lead to undesired results (robot arm)
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov, 2023)
- Develops DPO, a new preference optimization technique.
- KTO: Model Alignment as Prospect Theoretic Optimization (Ethayarajh et al., 2024)
- Introduces a new preference optimization algorithm and discusses theoretical trade-offs between various preference optimization algorithms.
- Also see Human centered loss functions github repo.
Supervised Finetuning/Training & Prompting
Core:
- Training language models to follow instructions with human feedback (Ouyang et al., 2022)
- Introduces instruction fine-tuning and InstructGPT.
Other:
- True Few-Shot Learning with Language Models (Perez et al., 2021)
- Training Language Models with Language Feedback (Scheurer et al., 2022)
- Shows you can use direct feedback to improve fine-tuning
- See also, Training Language Models with Language Feedback at Scale (Scheurer et al., 2023)
- Disclosure: first author is now at Apollo
- See also Improving Code generation by training with natural language feedback (Chen et al., 2023)
- Pretraining Language Models with Human Feedback (Korbak et al., 2023)
- Introduces a technique to steer pre-training toward models that are aligned with human preferences.
- The Capacity for Moral Self-Correction in Large Language Models (Ganguli, 2023)
- Investigation of when models learn moral self-correction through instruction following or by understanding relevant moral concepts like discrimination
Fairness, bias, and accountability
- Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products (Raji et al., 2019)
- Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing (Raji et al., 2020)
- Describes an end-to-end framework for auditing throughout the entire lifecycle of an AI system.
- Who Audits the Auditors? Recommendations from a field scan of the algorithmic auditing ecosystem (Costanza-Chock, 2022)
- Big survey of current auditing practices and recommendations for best practices
AI Governance
Core:
- Model Evaluations for extreme risks (Shevlane, 2023)
- Explains why and how to evaluate frontier models for extreme risks
- Anthropic: Responsible Scaling Policy (RSP) (Anthropic, 2023)
- Specifies if-then relationships where specific events, e.g. evals passing, trigger concrete responses, e.g. enhanced cybersecurity, that Anthropic is committed to uphold.
Other:
- METR: Responsible Scaling Policies (METR, 2023)
- Introduces RSPs and discusses their benefits and drawbacks
- OpenAI: Preparedness Framework (OpenAI, 2023)
- Specifies if-then relationships where specific events, e.g. evals passing, trigger concrete responses, e.g. enhanced cybersecurity, that OpenAI is committed to uphold.
- GoogleDeepMind: Frontier Safety Framework (GDM, 2024)
- Specifies if-then relationships where specific events, e.g. evals passing, trigger concrete responses, e.g. enhanced cybersecurity, that GDM is committed to uphold.
- Visibility into AI Agents (Chan et al., 2024)
- Discusses where and how AI agents are likely to be used. Then introduces various ideas for how society can keep track of what these agents are doing and how.
- Structured Access for Third-Party Research on Frontier AI Models (Bucknall et al., 2023)
- Describes a taxonomy of system access and makes recommendations of which access should be given for which risk category.
- A Causal Framework for AI Regulation and Auditing (Sharkey et al., 2023)
- Defines a framework for thinking about AI regulation backchaining from risks through the entire development pipeline to identify causal drivers and suggest potential mitigation strategies.
- Disclosure: Apollo paper
- Black box auditing is insufficient for rigorous audits (Casper et al., 2023)
- Discusses the limitations of black-box auditing and proposes grey and white box evaluations as improvements.
- Disclosure: Apollo involvement
Contributions
The first draft of the list was based on a combination of various other reading lists that Marius Hobbhahn and Jérémy Scheurer had previously written. Marius wrote most of the final draft with detailed input from Jérémy and high-level input from Mikita Balesni, Rusheb Shah, and Alex Meinke.