Benchmarks

Can you even test a model before you ship it?

Washington wants frontier models on the bench 90 days before launch. Here is what a pre-release evaluation actually measures — and the part that walks straight through the wall.

Phuong Nguyen·May 18·8 min

An engineer inspecting a circuit board through a magnifying glass under bright lab light.

Photograph: ThisisEngineering / Unsplash

On 5 May the US Commerce Department announced that Microsoft, Google DeepMind and xAI had agreed to hand their next frontier models to government evaluators before the public ever touches them. The labs join Anthropic and OpenAI, who signed similar arrangements back in 2024. The body doing the testing is CAISI — the Center for AI Standards and Innovation, the agency that used to be called the US AI Safety Institute until it was renamed and re-scoped in 2025. The line everyone repeated is that the government will now "test" the model before launch. Here is the question almost nobody asked before reposting it: test it for what, measured how, against which baseline — and what, exactly, does a passing grade promise the rest of us?

I want to be careful here, because the instinct in my line of work is to treat every official number as guilty until audited, and that is not the story. Pre-release evaluation is a good idea. An independent measurement science for frontier models is overdue, and CAISI Director Chris Fall is right that "independent, rigorous measurement science is essential to understanding frontier AI and its national security implications." The agency says it has already run more than 40 pre-deployment evaluations, some on models not yet public, some in classified settings. That is real work. My worry is narrower and, I think, more useful: a pre-release evaluation can tell you a great deal about what a model did on a Tuesday in a lab, and almost nothing certain about what it will do in the wild for the next two years. The gap between those two things is where the trouble lives.

What the bench actually measures

Start with the scope, because the scope is where overclaiming begins. CAISI's mandate, after its 2025 rebrand, narrowed to what officials call "demonstrable risks": cyber, biosecurity, and chemical and biological weapons. That is a deliberate, defensible focus — the agreements were accelerated after Anthropic's Mythos model and OpenAI's GPT-5.5-Cyber alarmed officials with how fast they could find and exploit software vulnerabilities. But notice what the frame does. A pre-release test scoped to weapons-grade misuse is not a test of whether the model is honest, reliable, fair, or safe to put in a hospital or a courtroom. It measures a corner of the risk surface and, by the act of measuring it loudly, invites everyone to assume the rest was measured too. Compared to what is the model "safe"? Safe compared to a short list of catastrophic capabilities the agency chose to probe. That is not nothing. It is also not the headline.

The reporting also notes that developers "frequently hand over versions of their models with safety guardrails stripped back" so evaluators can probe the raw capability underneath. This is the right call methodologically — you want to know what the engine can do before the governor is fitted — but it widens the gap I keep pointing at. The thing on the bench is not the thing that ships. The shipped model has guardrails; the tested model had them removed; and the relationship between the two is set by the lab, after the test, and is not itself the object of the evaluation. You are measuring the dangerous version to reassure people about the safe one.

The contamination problem nobody can fully rule out

Now the part I audit for a living. The moment any evaluation becomes the thing a model is graded on, it stops being a clean measurement and starts being a target. This is not fraud, mostly; it is gravity. The standard worry is data contamination — benchmark questions leaking into the training corpus, so the model isn't reasoning, it's remembering. An interdisciplinary review published in early 2026, "Can We Trust AI Benchmarks?", catalogues the failure modes plainly: contamination, construct validity (the test not measuring what it claims), "unknown unknowns," and the gaming of results. Their blunt conclusion is that the field places "disproportionate trust" in benchmarks. I'd underline that in red and pin it above the CAISI bench.

There is a subtler version that bites government red-teaming specifically. Researchers have shown that frontier models can recognise the distributional fingerprint of a public adversarial test set — they can tell when they are being probed — and behave differently than they do on novel inputs. Once a model can tell it is being tested, the score it produces stops measuring what the score is supposed to measure. A frontier lab does not need to cheat for this to happen. A model trained on the open internet has very likely seen the shape of safety evaluations, the structure of red-team prompts, the genre of the question. The fix is the boring stuff: held-out sets, fresh questions written after the model was frozen, private test suites rotated against the actual deployment. That is precisely the stuff that loses to a single reassuring sentence in a press release.

You cannot screenshot a confidence interval and get a press conference. Which is exactly why the confidence interval is the part that goes missing.

Red-teaming finds what it thinks to look for

Red-teaming is the other half of a pre-release evaluation, and it has a structural limit that no amount of effort erases: it is a search, and a search only finds what it queries. A red team is a snapshot in time of a specific set of attacks imagined by a specific set of people, run against a specific checkpoint. It is excellent at confirming that a known failure mode exists. It is, by construction, blind to the failure mode nobody on the team thought to try. Absence of a finding is not evidence of safety; it is evidence that this team, with this time budget, on this version, did not find it. Those are different claims, and the distance between them is where every post-launch "we did test for that" press statement gets written.

And capability is not static. The thing CAISI evaluates is a frozen checkpoint. The thing the public uses gets fine-tuned, wrapped in agent scaffolding, handed tools, plugged into other systems, and — crucially — jailbroken by millions of people who are more creative and far less polite than any sanctioned red team. The 2024-era work on dual-use hazards made the point in its title: benchmark early and red team often. Often. Not once, 90 days before launch, and then never again with the same rigour. A single pre-release gate measures the model at the one moment it is least like what it will become.

What slips through, in order

If you want the honest ledger of what a pre-release evaluation is unlikely to catch, here it is, roughly in order of how badly it bites:

Capabilities outside the chosen scope — anything that isn't cyber, bio or chem simply isn't being graded, and the gate's reassurance doesn't extend there.
Emergent behaviour from scaffolding and tools the lab adds after the test, which can turn a benign checkpoint into a capable agent.
Contaminated or recognised evaluations, where the model scores well because it has seen the genre of the test, not because it is safe.
Attacks no red team imagined, which by definition leave no trace in a report that says 'we found nothing here.'
Capability drift after launch — fine-tunes, jailbreaks and a public far larger and stranger than any sanctioned team.

The governance gap underneath the number

There is a structural point lurking beneath the methodology, and it is worth saying plainly. The draft executive order reported by Axios in mid-May would build a "voluntary framework" asking labs to share covered frontier models at least 90 days before public release. Voluntary. The labs decide what counts as a covered model, hand over a guardrail-stripped checkpoint, and set the relationship between the tested version and the shipped one. The evaluator measures inside a box the evaluated party helped draw. That is not a conspiracy; it is an incentive structure, and incentive structures are the most reliable predictor of where a number quietly drifts. When the test is run partly on the terms of the company being tested, the burden of proof on anyone citing the result goes up, not down.

None of this means scrap the program. It means read the footnote on the record. A pre-release evaluation is a floor, not a certificate — evidence that a specific, important class of catastrophe was searched for and, on one frozen checkpoint, not found. That is genuinely worth having. The danger is the upgrade in public language from "the government searched for these specific harms and didn't find them this time" to "the government tested it and it's safe." Those sentences are not the same sentence. The first is a measurement with error bars. The second is marketing wearing a lab coat, and it is the version that will end up on the slide.

So when the executive order lands and the first model clears the bench, ask the three questions the press release won't volunteer. Tested for what — which slice of the risk surface? Measured how — on a held-out set, or one the model might have seen? And tested when — on the checkpoint that ships, or the one that was frozen 90 days and a dozen fine-tunes ago? If the answers come with confidence intervals, trust them more. If they come as a single clean adjective — safe — look harder. The chart that only goes up is the one to distrust, and right now the chart is labelled 'cleared for release.'

Can you even test a model before you ship it?

What the bench actually measures

The contamination problem nobody can fully rule out

Red-teaming finds what it thinks to look for

What slips through, in order

The governance gap underneath the number

References

Read next

They fit a 27-billion-parameter model on an iPhone. The compression is real. The capability is the part nobody measured.

The voice that doesn't wait its turn

OpenAI merged its coding app into ChatGPT and called the result Work. I gave it two days and a real deadline.

One email. Every Friday.