A scalable monitoring research agenda
The goal of this agenda is to translate compute into safety at scale. In the best case, we figure out how to spend on the order of $10-100M to produce extremely good monitors for coding agents.
Three properties make monitoring unusually well-suited to this kind of scaling:
- Near-unlimited data. We can generate enormous quantities of training and evaluation data, such as by using tasks with verifiable ground truth, or by creating approximate ground truth.
- Blue-team vs red-team framing. Most of monitoring can be framed as an adversarial game, which means we can run extensive iterated adversarial training.
- Number-go-up structure. Most monitoring problems (prompt optimisation, fine-tuning, adversarial attack) reduce to number-go-up-games. This is the kind of research that can be handed off the earliest to automated agents.
Data
To train and select our monitors, we need a near-unlimited supply of trajectories with reliable labels, ideally from easily refreshable sources that can track distributional shifts in the agents we monitor. There are two types of such data sources.
Verifiable ground truth. Tasks where we can establish with certainty whether the failure mode we want to monitor for is present. The main-task-plus-side-task framing common in sabotage and control evaluations generalises well here and can be reused for monitoring: the monitor is supposed to catch the side task, and the side task is checkable in code. This gives us perfect conditions for automated red-teaming. The main blocker is finding scalable tasks with programmatically checkable ground truth. Existing environments like LinuxArena, BashArena and ControlArena are good starting points but not yet scalable enough on their own.
Approximate ground truth. When the failure mode can’t be checked programmatically, we have to use other monitors to label data, which creates a circularity problem: we need a well-calibrated monitor to produce the labels we’d use to train a well-calibrated monitor. We break the circularity in two ways:
- Privileged monitors. The labelling monitor gets information the deployed monitor doesn’t, e.g. the attack strategy used by the data generation pipeline. This makes labelling much easier than detection.
- Overpowered pipelines. The labelling pipeline can spend far more time and compute than a deployed monitor would. Ensembles of many monitors, tool-use to verify intermediate claims, multi-step verification, and so on can all push label quality well above what any single monitor would produce in deployment.
Our current default data pipeline collects real coding-agent trajectories from production usage and public datasets, then injects failure modes into them to produce a large dataset with approximate ground truth. The set of failure modes we inject comes from a maintained library of 40+ detailed descriptions and examples we have collected over time. They are drawn both from incidents that have happened in the wild and from theoretical concerns we want coverage on (e.g. scheming).
Techniques
With a large labelled dataset, we can compare and improve concrete monitoring techniques.
Basic techniques
These are a subset of the techniques we want to try even if we don’t expect large gains from scale. We treat them mostly as baselines for the more scalable methods below.
- Hierarchical monitors. First run a fast and general monitor, then run a more powerful or more specialised one on the hits. One natural shape: a single broad-coverage prompt covering many categories, followed by category-specific monitors on suggested matches.
- Ensembles of monitors. Run multiple monitors and combine results, either by resampling the same model or by combining different models.
- Discovery agents per trajectory. An agentic monitor that iteratively refines its hypothesis on a single trajectory, e.g. with RAG over similar examples or tools to execute code.
- Discovery agents across trajectories. An agent that looks at many trajectories at once, e.g. ranking them by severity to support calibration and triage.
- Debate between monitors. Two monitors argue for low vs high severity, with a third model as judge.
- Fine-tuned monitors. We follow the deliberative-alignment recipe: distill good reasoning from a frontier model into an open-weights model via SFT, then RL the result for further improvement. Earlier work along these lines produced encouraging results, and this is the baseline we expect adversarial methods to build on top of.
We have interesting early results already. So far, none of the fancy ideas (agentic monitors & debate) can beat simple ensemble baselines. Both ensembles and fine-tuning produce clear improvements.
Red-teaming
To improve on fine-tuned baselines, we adversarially attack monitors in an automated fashion, then SFT or RL on the resulting trajectories to produce an improved monitor.
The red team can take several forms. The simplest is a prompted agent or scaffold with access to a library of attack strategies and the ability to write down what works. More involved versions fine-tune dedicated red-team models for attack effectiveness. The version we want to scale to is a diverse population of attackers running in parallel, each producing different attack distributions, so we can upsample any dataset by running all red-teams and training against the union.
Adversarial games
The full version of red-teaming is an iterative adversarial game: red and blue trained in tandem, with the red team finding new attacks and the blue team learning to defend. There’s substantial existing literature to borrow from in self-play, automated jailbreaking, and parallel agent training. At full scale we’d run hundreds of red teams against hundreds of separate blue teams in parallel.
The hope is that this process compounds in the monitor’s favour. The monitor improves each round, while real-world models don’t get a corresponding chance to adapt against it. Over enough iterations, the result is a monitor that is hard to beat with any off-the-shelf attacker.
Automated research
Most of the work above (prompt optimisation, fine-tuning, adversarial attack, dataset improvement) reduces to number-go-up games. This is the kind of research that becomes automatable first, so we want to invest heavily in infrastructure for running many agents in parallel on these problems. At full scale, we envision thousands of coding-agent instances running continuously, iterating on methods, with access to write-ups of previous runs, relevant papers, and tools to distill their findings.
In practice, current agents are far from this vision. We are starting with the categories of work where they are most likely to be useful.
Mechanical-but-tedious work. Tasks where every step is straightforward but the whole thing requires holding a lot of context or just takes longer than a person wants to sit through. For example, running a coherent comparison of monitor performance across many datasets, or porting public control environments into our internal stack. No research taste required, effectively long instruction-following exercises.
Exploratory de-risking. Tasks with a vague target that we want cheap positive evidence on. For example, we want to know whether a debate setup between two monitors improves performance over a single monitor. We give the agent a broad description of the task and a dataset, and see how far it gets. If something works, we take that as a real positive signal (assuming the agent didn’t game the metrics, which still happens regularly). If nothing works, the result is uninformative because the agent might just be too weak.
Data improvement. Given a dataset and a generation pipeline, the agent runs the pipeline, inspects outputs, identifies problems, makes changes, and tests whether the dataset is better according to its judgment. Right now, the agents often overfit to a small problem in the dataset, but that still often produces useful criticisms that can be used as seeds for genuine improvements.
We have built a very basic setup that orchestrates agents to help with these problems. The agents sometimes run for 10+ hours for some tasks, though the output can range from surprisingly high quality to completely useless “research slop”. We are actively trying to increase the security of these automated deployments through our monitoring and control protocols.