Benchmarks

A panel of AIs beat the best model on 100 questions. Which 100, and graded by whom?

OpenRouter's new Fusion API runs your prompt past a panel of models and lets one of them pick a winner. The mechanism is real, and old. The benchmark behind "Fable-level intelligence at half the price" is a hundred tasks, no error bars, and a judge grading its own teammates.

Phuong Nguyen·Jun 16·8 min

Image: OpenRouter

The line going around this week is that you can now rent "Fable-level intelligence at half the price." On June 13, OpenRouter shipped Fusion, an API that takes your prompt, fans it out to a panel of language models at once, and has one of them act as a judge that synthesizes the best answer out of the pile. The number doing the persuading: on a set of 100 research tasks, Fusion reportedly beat both GPT-5.5 and Claude Opus 4.8, and a budget configuration came in near Claude Fable 5. It is a genuinely clever product, and that is exactly why it is worth slowing down before the screenshot goes any further. Topped which 100 tasks, measured how, graded by whom — and where, on the entire claim, are the error bars?

Let me describe what Fusion actually does, because the mechanism is real and deserves a fair hearing before I start pulling on threads. You send a prompt. In the panel stage, up to eight models answer it in parallel, each able to run its own web searches. In the judge stage, a designated model reads all of those answers and produces a structured breakdown — OpenRouter's own documentation lists the fields: consensus (the points most models agree on, treated as higher-confidence), contradictions, partial coverage, unique insights from individual models, and blind spots none of them caught. Then a final model writes the answer using that breakdown. There are two presets, a strong one and a cheap one. This is a sensible design. I want to be clear about that up front, because the rest of this column is going to be unkind to a sentence, not to the engineering.

Compared to what — and compared to when

My first question about any benchmark is "compared to what," and my second, which people forget, is "compared to when." Because the idea underneath Fusion is not new. Getting several models to answer and then aggregating or having them deliberate is a technique with a literature: mixture-of-agents work from 2024 showed that pooling the outputs of several open models could match or beat a single stronger one; self-consistency, debate, and best-of-n sampling all live in the same neighborhood. The finding that a committee beats an individual on certain tasks is one of the more robust results in the field. So the honest framing of Fusion is not "new capability discovered." It is "known technique, productized behind one tidy API." That is a real contribution — plumbing matters — but it is a different and smaller claim than the headline, and the gap between them is where I start reading footnotes.

Here is the first footnote, and it is a big one. OpenRouter's documentation — the pages that tell you how Fusion works — contains no benchmark and no pricing claim at all. The "beat GPT-5.5 and Opus 4.8 on 100 research tasks" figure and the "half the price" line live in the launch announcement and the coverage that repeated it, not in the technical docs. That is not damning by itself; launch posts are where you put your best number. But when the impressive figure lives in the marketing and the method lives nowhere, the figure is a marketing artifact until proven otherwise, and it should be read with the skepticism you would bring to any other advertisement.

A hundred tasks, no error bars

Now the number itself. A hundred research tasks. Start with the sample size: 100 is small for a capability claim, and nobody has published the variance, so we cannot tell whether "beat Opus 4.8" means won 70 of 100 or won 51 of 100. Those are wildly different results and the headline renders them identically. A difference that would vanish under a confidence interval is not a difference; it is a coin landing heads slightly more than tails on a Tuesday. Without error bars, "surpassed" is a direction, not a magnitude, and direction without magnitude is how you turn noise into a press release.

Then ask who wrote the tasks and what "research tasks" even means. Open-ended research questions are precisely the category that is hardest to grade objectively — there is rarely one right answer, so scoring collapses into a judgment about which response is better written, more thorough, more confident-sounding. That is fertile ground for a grader's preferences to masquerade as a measurement of quality. If the 100 tasks were chosen by the same people selling the product, the selection itself is a free parameter: you can almost always find a hundred prompts on which your system looks good. The fix is boring — a held-out set, written by someone with no stake, released so others can re-run it — and the boring fix is exactly what is missing.

Without error bars, "surpassed" is a direction, not a magnitude — and direction without magnitude is how you turn noise into a press release.

But the detail that should stop you cold is the judge. The thing scoring these outputs is itself a language model, and language-model judges have well-documented biases: they reward longer answers, they are swayed by position and formatting, and — most relevant here — they show self-preference, rating outputs from their own model family more highly. Fusion's default judge is Claude Opus, and its default panel includes Claude models. If a Claude-class model is also what graded the launch benchmark, then what the 100-task result measures is not "which answer is correct" but "which answer the judge liked," and the judge has a relative. I am not alleging that happened; I am saying the launch material does not rule it out, and in evaluation, what you cannot rule out you must assume. A test where the referee plays for one of the teams is not a test. It is a scrimmage with a scoreboard.

Half of what, exactly

"Half the price" deserves the same treatment, because on its face it cannot be literally true. Running three to eight models in parallel, plus a judge pass, plus a final synthesis, is strictly more tokens and more compute than calling one model once. Fusion is not cheaper than a single model in absolute terms; it is more expensive than most of them. So the claim has to mean something narrower: for the specific tasks where a panel of cheaper models matches a frontier model's quality, the panel costs roughly half what a single frontier call would have. That is a real and useful proposition — cheaper components ganging up to equal an expensive one — but it is conditional on the match holding, and the match is exactly the thing the soft benchmark has not established. Strip the condition and "half the price" becomes a promise the architecture cannot keep on the general case.

And the price you pay is not only in tokens. Eight parallel calls, a judge, and a synthesis step is slower than one call, and you are billed for every model in the panel whether or not its answer survives the judge. For a throwaway prompt that is pure waste — OpenRouter says as much, recommending Fusion only "when the cost of being wrong outweighs the cost of a few extra completions." That is an honest caveat, and it quietly contradicts the headline: a tool you are told to reserve for high-stakes prompts is not a general-purpose way to get frontier quality for half price. It is an expensive instrument for the cases where expense is justified.

Sample size: 100 tasks, with no reported variance — a 51–49 split and a 70–30 split would produce the identical "surpassed" headline.
Grading: an LLM judge (default Claude Opus) scoring a panel that includes Claude models invites self-preference bias; the launch material does not say who graded.
Task selection: "research tasks" are hard to score objectively and easy to cherry-pick; no held-out, independently written set has been published.
Cost: a panel plus judge plus synthesis is more compute than one model, so "half the price" can only hold on the subset of tasks where the ensemble matches a frontier model — latency and full panel billing included.

So is the result significant, or merely impressive? Impressive is a single number on a launch slide. Significant is a public task list, fresh held-out questions, a grader from outside the panel's own family, and a confidence interval that does not straddle a tie. We have the first and none of the rest. That does not make Fusion bad — I think the deliberation design is good and the gains from ensembling are well-supported elsewhere. It makes the specific claim unfalsifiable as published, which is a different verdict from false, and an important one to keep separate. The absence of homework is not proof of cheating. It is just the absence of homework, and you should price the claim accordingly: at roughly nothing until the method shows up.

I will end by crediting the honest part, because there is one and it deserves saying. Fusion's judge output — surfacing consensus, contradictions, and the blind spots none of the models covered, as structured data you can read — is a genuinely good idea, because it shows you the disagreement instead of laundering it into one smooth answer. That is the transparent instinct. The headline that flattened all of it into "Fable-level intelligence at half the price" is the opposite instinct, and the two came out of the same company on the same day. Use the tool if the disagreement map is worth the bill. Just don't repost the number. If the chart only goes up, look harder; if the price only goes down, look harder still at what you are paying in tokens, in latency, and in the quiet assumption that the model grading the test was not rooting for its own side.