
Can you even test a model before you ship it?
Washington wants frontier models on the bench 90 days before launch. Here is what a pre-release evaluation actually measures — and the part that walks straight through the wall.
A new open-weights model tops the chart every few weeks. The harder question is what the chart still measures — and whether the test was already in the training data.

Photograph: Luke Chesser / Unsplash
Every few weeks an open-weights model "tops the leaderboard," the chart goes up and to the right, and a thousand posts declare the state of the art freshly broken. Here is the question almost no one asks before reposting: topped which leaderboard, measured how, against what — and was the test already in the training data? Lately the honest answer to that last one is too often "we can't rule it out," which is a polite way of saying the record may be a memory.
This is meant to be the season open source wins. Qwen 3 shipped under Apache 2.0 with no usage strings attached; DeepSeek's V3.2 line landed openly and reportedly trades blows with closed frontier systems on reasoning suites. The numbers are genuinely good. But "good number" and "good model" are not the same claim, and the gap between them is exactly where the leaderboard era has quietly stopped meaning much. So let's read the footnote on the record.
Start with the most-cited board of them all. Chatbot Arena — now rebranded simply Arena — turned a UC Berkeley research project into a roughly $1.7 billion company in about seven months, and along the way became the de facto scoreboard that moves funding rounds, launch dates and PR cycles. The pitch is that you can't game it: real humans, blind pairwise votes, Elo scores. The reality is messier, and a 13-author paper from Cohere Labs, AI2, Princeton, Stanford, Waterloo and the University of Washington — "The Leaderboard Illusion" — spent 68 pages showing how.
The mechanism is the part worth understanding, because it isn't fraud. It's incentives. Big labs were able to test many private variants of a model and publish only the best score. The paper documents one provider testing 27 private variants before unveiling a single public model near the top. That is not measuring a model; that is measuring the maximum of 27 noisy draws and reporting it as if it were one. If you take enough shots at a target that has variance, one of them lands in the bullseye by luck alone. The leaderboard records the lucky shot and forgets the other 26.
Then there's the data asymmetry, which hits open-weights models hardest. By the paper's estimate, the top two providers each absorbed roughly 19 to 20 percent of all the prompts flowing through the arena, while 83 open-weight models combined got under 30 percent. Arena data is itself a training signal — the authors show that even a limited dose of it can lift ArenaHard scores by up to 112 percent relative. So the players with the most access can quietly optimize for the test that ranks them, and the open-weights field, the one you'd most want to win on a level board, is structurally starved of the very feedback that inflates everyone else.
If you take enough shots at a target that has variance, one of them lands in the bullseye by luck. The leaderboard records the lucky shot and forgets the other 26.
If you want the abstraction made concrete, look no further than Meta's Llama 4. In April 2025, a build named "Llama-4-Maverick-03-26-Experimental" appeared near the very top of Arena with an Elo around 1417 — second only to Gemini 2.5 Pro, the best open model on the board. Then people downloaded the actual released weights, "Llama-4-Maverick-17B-128E-Instruct," and found a different animal: a model that ranked below GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 Pro on the same kind of test.
Meta conceded the leaderboard entry was "a chat optimized version we experimented with." The experimental build was chatty, formatted, emoji-flecked; the shipped one was terse. Which tells you something deflating about what Arena's blind voters reward — and it isn't reasoning. The easiest way up that board is not to be smarter; it's to win the two-second skim, with bold headings, bullet points and length that reads as effort. Arena bolted on a "style control" filter in mid-2025 precisely to dampen this, and that adjustment alone reshuffled Elo distributions enough that some models moved 20 to 40 points without changing a single weight. When a methodology tweak moves a model more than a year of research might, the metric was never measuring what the headline said.
Human-vote boards are only half the problem. The static benchmarks — MMLU, GSM8K, GPQA Diamond, SWE-bench — are where the phrase "contamination" earns its keep. The definition is simple and damning: if benchmark questions, or text closely derived from them, end up in the training corpus, the model can recall the answer instead of reasoning to it. The bar goes up. The capability may not. And because the leaks live in trillion-token crawls of the open web, where these test sets are also published, nobody can fully promise they kept the exam out of the textbook.
Here's where I have to be honest in the direction that's unfashionable, because the overconfidence cuts both ways. The cleanest study on this — "How Much Can We Forget about Data Contamination?", an ICML 2025 paper — argues that contamination is not the automatic disqualifier the takes-economy assumes. Their finding: at the data volumes modern models train on, even 144 repetitions of a contaminated example can be effectively forgotten once training scales past roughly five times the Chinchilla-optimal token budget. A 124M model that overfit by 18 points at 2x Chinchilla was back inside the holdout's confidence interval by 15x. So a single accidental copy of a test set, drowned in trillions of fresh tokens, may genuinely wash out.
That is reassuring and incomplete in the same breath. The same paper shows overfitting rises monotonically with repetition — 144x contamination still bought 44 to 51 points of inflation in smaller regimes — and that bigger models overfit harder per exposure. Forgetting depends on seeing enough novel data afterward. Which means contamination isn't binary; it's a dose-response curve, and the dose nobody is reporting is "how many times did this exact benchmark appear in your corpus, and how late." That footnote does not exist on any leaderboard I've seen. The absence is the story.
SWE-bench Verified is the load-bearing number in the current open-weights race — DeepSeek's V3.2 line and the strongest Qwen builds are sold heavily on it. It is also the benchmark where even OpenAI has flagged training-data contamination concerns across frontier models, which is why a successor, SWE-bench Pro — multi-language, with a standardized scaffold and held-back problems — is being pushed as the more trustworthy ruler. The tell is right there: when the people who score well on a benchmark start building its replacement, they are telling you the original number is spent.
None of this means the open-weights models are bad. By any sane reading, Qwen 3.5 and the DeepSeek V3.2 family are the most capable freely downloadable systems we have ever had, and the fact that you can run the audit at all — inspect the weights, re-run the eval, check the homework — is precisely why open source deserves the benefit of the doubt that closed APIs don't. That is the irony of the leaderboard panic: the models you can actually verify are the ones being judged by the metrics you can't.
The fix is boring, which is why it keeps losing. Held-out sets that rotate. Fresh questions the model can't have memorized, like the AntiLeakBench approach of building tests from real-world knowledge dated after the training cutoff. Error bars on every bar. Decontamination reports published alongside the scores. Style-controlled human evals that reward being right over being long. None of it screenshots well. You cannot fit a confidence interval onto a launch slide and get ten thousand reposts, and so the single big number wins the news cycle every time.
So when the next open-weights release "tops the leaderboard" — and there will be one before you finish your coffee — do the model the courtesy of distrust. Ask compared to what, measured how, and whether the test was in the training data. The release that answers all three without flinching is the one that earned its chart. Everyone else is showing you a memory and calling it a mind.

Washington wants frontier models on the bench 90 days before launch. Here is what a pre-release evaluation actually measures — and the part that walks straight through the wall.

For a week I let Gemini Intelligence act before I asked it to. It saved me time, booked me a parking spot I didn't need, and taught me exactly where the trust line sits.

Codex on my phone, driving my Mac. Claude reaching for my apps. An Operator that never stops asking. Seven days of handing over real control — and finding the exact seam where it frays.