Frontier models

The model that answers you is not always the model you asked

Anthropic's Claude Fable 5 ships with a gate inside: some questions are quietly handed to a weaker model, some answers are degraded without a label, and the full version is reserved for the vetted. The mechanism deserves more scrutiny than the launch.

Amy Mercer·Jun 11·9 min

Anthropic's announcement artwork for Claude Fable 5 and Claude Mythos 5

Image: Anthropic

When you send a question to Claude Fable 5, the newest and most capable model Anthropic has ever sold to the public, something reads it before the model you paid for does. If that something decides your question belongs to one of a handful of forbidden domains — offensive cybersecurity, certain biology, certain chemistry, techniques for extracting a model's knowledge into a copy — your prompt is handed to Claude Opus 4.8, an older and less capable system, which answers in Fable's place. Anthropic says this happens in fewer than 5% of sessions. The reply arrives in the same window, under the same name.

A model that varies its competence depending on what you ask is not a new idea; every safety-trained system refuses some things. What is new, in the release Anthropic shipped on June 9, is the architecture of the refusal. Fable 5 does not say no. It substitutes. And in one category of cases — described by Anthropic itself, in the fine print of its launch materials — the substitution is invisible by design.

The launch deserves to be described accurately before it is argued about, because the details are load-bearing. Anthropic released two models this week that are, by its own account, the same underlying system. Claude Mythos 5 — the unrestricted version — goes only to organizations the company has already vetted: the critical-infrastructure defenders in its Glasswing program, with cyber safeguards lifted, and soon a set of approved biology researchers, with biology and chemistry safeguards lifted. Claude Fable 5 — the same engine with the gate installed — is what everyone else gets. It costs $10 per million input tokens and $50 per million output tokens, double the price of Opus 4.8, and is included in paid subscription plans until June 22, after which it draws down usage credits. The company describes Fable as Mythos-class intelligence "with additional safety measures for dual-use capabilities." That description is accurate. It is also doing a great deal of quiet work.

The commercial choreography deserves a sentence of its own before the safety machinery does. A capability priced at exactly double the previous flagship, included free for thirteen days and metered thereafter, is a public estimate of two private quantities: what the model costs Anthropic to run, and what the company believes proximity to the frontier is worth before the next rival release resets the market. That is not a utility's pricing. It is the pricing of a scarce good — which is worth noticing, because scarcity is also the premise of the safety architecture. The commercial tiers and the safety tiers are different gates on the same asset, and only one of them is described in the press release as a moral position.

The gate, mechanically

Three separate mechanisms ship inside Fable 5, and they should not be discussed as one, because they make different promises and fail in different ways.

The first is the visible reroute. Queries classified as touching cyberattack capability, biological or chemical risk, or *distillation* — the technique of training a cheaper model to imitate an expensive one by harvesting its outputs — are answered by Opus 4.8 instead. Anthropic reports the reroute fires in under 5% of sessions, and that at least 95% of Fable sessions run entirely on Fable's own responses. Note the provenance of those numbers: they are the vendor's, measured by the vendor's classifier, on traffic only the vendor can see. They may well be correct. They are not, at present, checkable by anyone outside the company, and a classifier that decides what counts as a cybersecurity question is itself a model with a false-positive rate nobody outside has measured.

The second mechanism is the one that has made researchers angry, and on the evidence they are right to be precise about why. For requests Anthropic's systems read as cutting-edge AI development work — "building the infrastructure used to train large AI models," in the company's own description — Fable 5 does not reroute. It answers itself, with what the company calls interventions to limit its effectiveness. The response is plausible, fluent, and deliberately worse. Anthropic says this affects roughly 0.03% of traffic. Unlike the reroute, the company states plainly that this degradation is not visible to the user.

The third is a change to the data bargain. All Fable and Mythos traffic is now retained for thirty days — including for enterprise customers who had previously negotiated zero-retention agreements — justified as a defense against novel jailbreaks. Whatever one thinks of the justification, the asymmetry is worth stating: the customers lost a contractual privacy term, and the company gained a corpus of exactly the adversarial prompts it needs to tune the gate.

The abridged edition

One analogy, used carefully and then retired. Imagine a library that, for certain readers and certain subjects, hands over an abridged edition — same cover, same spine, no marking anywhere on the book. The librarian may have excellent reasons. The policy may even be right. But every reader who walks out believes they have read the book, and some of them will go on to cite it. The harm of a silent abridgment is not only what was removed; it is that the removal poisons the reader's confidence in everything that remained.

This is the specific complaint of the researchers who spent the past two days documenting Fable 5's behavior, and it is worth quoting them, because the objection is more technical than the phrase "AI safety backlash" suggests. Nathan Lambert of the Allen Institute for AI: "To have my access to the cutting edge models for my work rug pulled in an under the table fashion is appalling." Dean Ball of the Foundation for American Innovation called it a "secret sabotage" policy. Jeremy Howard of Fast.ai noted the competitive shape of the thing: the company building frontier models grants itself full capability while quietly limiting the assistance available to anyone else building them. Behnam Neyshabur, formerly of Anthropic, argued the restriction is a net negative outright. Anthropic's Dianne Na Penn, asked about the criticism, said the goal was balancing "frontier performance" with "the right guardrails in place to make it accessible, and generally in a safe manner." The company declined to elaborate further.

A benchmark is a claim about a model that holds still. Fable 5, by design, does not hold still.

What the evaluations can no longer tell you

Set aside, for a moment, whether the gate is justified — there is a serious argument that it is, and we will get to it. Consider first what silent capability variation does to measurement, because measurement is the part of this industry that was already in poor health.

A benchmark score is a claim about a fixed object: this model, these weights, this behavior. Fable 5's launch materials carry impressive third-party numbers — a first-ever 90% on one analytics firm's core benchmark, top marks from early partners on app generation and tool use. Those numbers were presumably produced on traffic the classifier waved through. But a developer evaluating Fable 5 for their own workload now faces a question no leaderboard can answer: is the model I am testing the model I will get? If the work sits anywhere near the gated categories — and "infrastructure used to train large AI models" describes a meaningful fraction of serious machine-learning engineering — the honest answer is: you cannot know from the outside. The 0.03% figure, if accurate, says the degradation is rare. It does not say where it lands, and for a research lab it lands, by construction, on exactly the queries that matter most to them.

Anthropic's evidence for the gate's robustness is a bug bounty: more than 1,000 hours of external testing without a universal jailbreak. That is a real result and worth crediting. It is also worth being precise about what it establishes — that no one found a single key that opens every door. It does not establish the false-positive rate on legitimate work, which is the number the critics are actually asking about, and which only Anthropic can currently measure.

There is a version of this launch that would have cost Anthropic little and answered most of the measurement objection: a per-response disclosure — a flag in the API metadata, the way provenance standards already watermark generated images — indicating that a reply was rerouted or degraded. The company clearly weighed visibility, because it chose it for one mechanism and declined it for the other, and the asymmetry tracks the commercial stakes more closely than the safety ones: the reroute, which mostly inconveniences would-be attackers, is acknowledged; the degradation, which touches competitors and researchers, is silent. There may be a defensible rationale — a visible flag is also an oracle for mapping the gate's edges, probe by probe. But the rationale has not been published, and an unpublished rationale for an invisible intervention is precisely the shape of thing outside auditors exist to check.

From safety policy to clearance system

The strongest version of Anthropic's case goes like this. The company has been arguing for months, with published evidence from its Glasswing program, that frontier models can now find serious vulnerabilities in critical infrastructure at scale. If you believe that — and the company's own threat reports are among the better evidence in public — then shipping the same capability to anonymous API keys is genuinely reckless, and some gate is the responsible engineering choice. The reroute, on this view, is what taking your own risk assessments seriously looks like. There is also a business-honest reading of the distillation safeguard: Anthropic is reported to be heading for the public markets, and a model that will cheerfully train its own competitors is, among other things, a leaking asset.

But notice what the architecture adds up to, because it is more novel than any single component. Capability is now tiered by institutional identity: full strength for Glasswing partners and approved researchers, gated strength for the public, with a "broader trusted access program" promised but not yet specified. Access to the frontier is granted not by purchase but by vetting — and the vetting is performed by the vendor, against criteria the vendor has not published, with no appeal process anyone has described. That is not a content policy. It is a clearance system, operated by a private company, for access to what the company itself describes as the most capable cognitive tool it has ever built. Export controls at least emerge from a process with elections somewhere upstream of it. This gate's legislature, judiciary and customs office are the same trust-and-safety team.

To be clear about confidence: that the gate exists, what it covers, and that part of it is invisible are all established by Anthropic's own documentation. That the invisible part touches only 0.03% of traffic is the company's unverified figure. Whether the gate meaningfully slows a determined bad actor, and how often it silently shortchanges a legitimate one, are open questions — and they are open in both directions.

The precedent is the product

Anthropic is not the only laboratory heading this way; rivals have their own restricted-capability variants for sensitive domains, and every frontier lab now maintains some machinery for deciding who gets what. What Anthropic shipped this week is simply the most explicit version: the same weights, two names, and an admission — unusually candid, if you read it closely — that the public will sometimes get answers engineered to be worse, without being told when.

The company would say, fairly, that it has disclosed the policy even if the model does not disclose each instance. The researchers would say, also fairly, that a disclosure you cannot detect in practice is a strange kind of transparency. Both are true, which is what makes this launch the precedent that matters. The industry has spent three years arguing about whether models are too capable to release. Anthropic has quietly moved to the next question: not whether to release capability, but who gets it at full strength, on whose say-so, and how anyone outside would know. That question will outlast this model, and at present exactly one party in it can see the gate from both sides.

The model that answers you is not always the model you asked

The gate, mechanically

The abridged edition

What the evaluations can no longer tell you

From safety policy to clearance system

The precedent is the product

References

Read next

The model that answers every Google search is the cheapest one Google has

I went to try the best open model in the world. It was sold out.

Google delayed its flagship model for missing a coding target. Nobody outside Google can see the target.

One email. Every Friday.