Models & Safety

Two AI models are coming back online. The condition for their return cannot be met.

Anthropic says Fable 5 and Mythos 5 return in days, once a jailbreak is "fixed." Jailbreak resistance is not the kind of thing that has a zero — which makes "fixed" a decision, not a measurement.

Amy Mercer·Jun 20·9 min

Anthropic's published statement on the US government directive regarding its Fable 5 and Mythos 5 models

Image: Anthropic

A jailbreak is not a hole in a wall. It is closer to a coin that lands the wrong way more often than the manufacturer would like. You cannot point to the place where a language model's safety training fails the way you can point to an unlocked door; you can only run the model against an adversary many times and count how often the guardrail holds. The number is never one. It was never going to be one. And for the past week, the worldwide availability of two of the most capable AI models ever shipped has rested on whether that number can be made to equal one — a question that, stated precisely, has no answer.

On June 9, Anthropic released Fable 5 and Mythos 5, the new top of its model line. Three days later, on the evening of June 12, a U.S. export-control directive arrived, and within hours both models went dark — not in Korea, not in China, but everywhere, for everyone. On June 18, at the opening of the company's new Seoul office, Anthropic's managing director for international, Chris Ciauri, told reporters he was "very confident that in the coming days, the models will become available again." The interesting word in that sentence is not "confident." It is "when."

Why a Korean carrier turned off a model in California

The order did not begin as a statement about Anthropic. It began, by the available accounts, as a statement about one of its customers. SK Telecom, South Korea's largest carrier and — by its own disclosure — a roughly $100 million investor in Anthropic, had its access to Mythos revoked days before the broad directive, after the White House identified it as a national-security concern over alleged ties to China. SK Telecom has denied those ties. I want to be careful here: the specifics of what the government found are not public, and "alleged" is doing real work in that sentence. What is not in dispute is the shape of the action — access pulled from a single named company first, and then, days later, from everyone.

The reason it became everyone is a piece of legal plumbing called a deemed export. Under U.S. export-control law, giving a controlled technology to a foreign national counts as exporting it to that person's country — even if the person is standing in California, even if they work for you. Applied to a model served from shared cloud infrastructure, a rule that bars all foreign nationals from access is not a rule you can enforce selectively. There is no setting for "everyone except the people the order names." Anthropic's only mechanically available form of compliance was to switch the models off for the whole world, including its own non-citizen employees. So it did. The blunt instrument was not Anthropic's choice of response; it was the only response the instrument permitted.

The thing they want fixed

Underneath the export language is a security finding, and the finding is where the technical story starts to matter. The government's stated concern is a jailbreak: a technique — reportedly some combination of Unicode wrapping, roleplay framing, and deliberately confusing context — that coaxed Fable 5 past the classifiers meant to wall off Mythos's most sensitive cybersecurity capabilities. Anthropic has described the vulnerability as narrow, affecting a specific scenario rather than collapsing the model's defenses across the board. Amazon, Anthropic's largest investor and its cloud partner, is reported to have found the technique and taken it directly to the Commerce Department rather than through the usual coordinated-disclosure channel. Hold that detail; it will matter later.

On June 13, David Sacks, who co-chairs the President's Council of Advisors on Science and Technology, gave the administration's version on the record: "The Admin asked Dario to fix the jailbreak or de-deploy the model. Dario refused." Framed that way, it sounds like a company declining a reasonable safety request. The framing is clean. The engineering underneath it is not.

Jailbreak resistance does not have a zero

Here is the part most coverage skips because it is hard, not because it is unimportant. A model's resistance to jailbreaks is not a feature that is either present or absent, like a lock that is either engaged or not. It is a statistical property of an enormous, partly inscrutable function. You measure it by attacking the model — thousands of times, with thousands of phrasings — and reporting how often the safety behavior survives. You can push that survival rate higher with more training, better classifiers, harder red-teaming. You cannot push it to a hundred percent, because the space of possible inputs is effectively infinite and the next phrasing nobody tried is always out there. Every serious lab knows this. Anthropic has said it plainly: perfect jailbreak resistance is not possible.

"Fix the jailbreak" sounds like "patch the bug." It is closer to "eliminate the possibility of a clever sentence."

This is why Anthropic's sharpest objection is not a complaint about fairness but a statement about the standard itself. The company argues the same class of jailbreak applies to other deployed frontier models — it named OpenAI's GPT-5.5 — none of which face an export restriction, and that, in its words, "if this standard was applied across the industry, we believe it would essentially halt all new model deployments for all frontier model providers." Read that not as special pleading but as a claim about logic. If the bar for keeping a model online is "no jailbreak exists," then no model clears it, because for every model some jailbreak exists. A rule that, applied evenly, forbids everything is not a safety rule. It is a discretion engine: it lets an authority switch off whichever model it chooses and point to a condition that is technically always true.

What a patch will, and will not, do

Picture the fix on its merits. Anthropic's engineers will take the reported technique — the Unicode wrapping, the roleplay scaffolding, the context confusion — and train against it, feeding the model and its classifiers thousands of variations until that specific attack reliably fails. This works, in the narrow sense. The exact prompts that embarrassed the model last week will stop working. It is also, in the broad sense, the opening move in a game with no last move. Adversarial robustness in language models behaves like a balloon: press the failure rate down in one region of the input space and it bulges somewhere you were not testing. The history of the field is a decade of this — a defense published, an attack that defeats it published a few months later, a better defense after that. There is no reason to expect this round to end the pattern, and good reason, rooted in the math, to expect it not to.

So a patch buys two things, and it is worth being honest about both. It raises the cost and lowers the frequency of the attack, which is a real safety improvement and not nothing. And it produces a moment — a release note, a date — at which someone can say the word "fixed" and mean "we addressed the specific thing you showed us." The government gets to point to that moment as compliance; the company gets its models back. Everyone has an interest in treating "fixed" as a binary even though the property it describes is a dial. The danger is not the patch. The danger is mistaking the press release for the physics.

What "when it's fixed" actually decides

Which brings us back to Ciauri's "when." The administration has indicated the directive lifts once Anthropic patches the jailbreak, and that it wants all of this resolved as soon as possible. That sounds like a clear technical finish line. It is not. Because jailbreak resistance has no zero, "patched" cannot mean "no jailbreak remains." It can only mean "patched enough" — and "enough" is a judgment, not a measurement. Someone has to look at a model that will still fail some adversarial prompt some fraction of the time and declare it acceptable. The deep change this episode marks is who that someone is. It is no longer only the lab's safety team weighing a risk against a release. It is a government office deciding, on a standard it has not published, when a privately built model may speak to the world again.

I can state the asymmetry with some confidence, because it does not depend on the classified details. Before this month, a frontier model went offline when its maker chose to take it down. After this month, one went offline — globally, in hours — on a security finding the public cannot read, triggered by a concern about a single customer, enforced through a mechanism built for physical goods, and conditioned for return on a standard that cannot be satisfied in the literal terms it was stated. Each link in that chain is individually defensible. The chain is the story.

The detail to keep watching

Remember the disclosure path. The vulnerability did not surface through a researcher's coordinated report or the model-maker's own red team; it reportedly traveled from Anthropic's largest investor straight to the Commerce Department. I am not alleging a motive, and I would caution anyone against assuming one — there are entirely ordinary reasons a cloud partner might route a serious finding to the government quickly. But it is worth noticing, calmly, that the machinery now exists for a security finding to become an export action against a specific model with extraordinary speed, and that the parties who can pull that lever include the model-maker's own commercial partners and rivals. When the cost of taking a competitor's model offline is a credible jailbreak and a phone call, the jailbreak stops being only a safety question and becomes a competitive instrument. That is not a prediction. It is a description of the incentives now in place.

The immediate consequences are smaller and more mundane, which is its own kind of evidence. Korea — Anthropic's twelfth-largest market, and the venue for this week's confident reassurances — spent the week with its enterprises cut off from the models even as the company opened an office there and signed up new ones. Affected customers were offered refunds, with a cutoff this weekend. A frontier-model launch became a billing-and-trials cleanup. Whatever else this was, it was not the orderly safety process the official framing describes.

So Fable 5 and Mythos 5 will almost certainly come back, probably within days, probably with a patch that raises the jailbreak's difficulty and changes nothing about its existence. The models will be, in every measurable sense, the same kind of object they were on June 11: extraordinarily capable, imperfectly steerable, and impossible to make perfectly safe. What will have changed is not the technology. It is the answer to a question we did not used to have to ask. Not "is the model safe" — no model is safe in the absolute sense the order demands. The question now is who gets to say it is, on what evidence, and whether "fixed" will ever again mean anything more precise than "allowed."

Two AI models are coming back online. The condition for their return cannot be met.

Why a Korean carrier turned off a model in California

The thing they want fixed

Jailbreak resistance does not have a zero

What a patch will, and will not, do

What "when it's fixed" actually decides

The detail to keep watching

References

Read next

Google switched off the Gemini CLI yesterday. I spent the week on its replacement.

A panel of AIs beat the best model on 100 questions. Which 100, and graded by whom?

The agents just got a wallet and a place to sleep. I have questions about both.

One email. Every Friday.