Machine learning

Grok 4.5 might beat Claude Opus. There is no way for you to check.

Elon Musk says his newest model rivals the best in the world. The test that would settle it is being run by the company that owns the model — and that is starting to look less like an exception than a direction.

Amy Mercer·Jul 5·8 min

Image: xAI logo via Wikimedia Commons (trademark; public domain as a simple wordmark)

A benchmark score is not a property of a model. It is a claim about a comparison, and a comparison needs someone to run it. When Elon Musk said late last month that internal evaluations show his newest model, Grok 4.5, is “close to, and potentially surpasses, Anthropic’s Claude Opus,” he was not reporting a measurement so much as reporting the result of one — and the useful question is not whether the number is high but who held the ruler. On the evidence available, the answer is that the ruler was held by the company that makes the model, in a room no one else was allowed into, against a rival the company competes with. That does not make the claim false. It makes it, as stated, impossible to check — and those are different things worth keeping apart.

The facts, first, and their confidence levels, because the rest of this depends on them. Musk announced on the 28th of June that Grok 4.5 — built on xAI’s proprietary 1.5-trillion-parameter architecture, with additional training data drawn from the coding tool Cursor, which the Musk empire is in the process of acquiring — is undergoing internal testing at SpaceX and Tesla. He said the internal evaluations put it at or above Opus. He said xAI intends to release a new model, trained from scratch, every month through the end of the year. All of that is well attested; it comes from Musk directly and has been reported consistently. What is not attested — by anyone, anywhere — is a single independent number. As of this writing, no version of the current Grok generation has been submitted to Humanity’s Last Exam, to the public arena rankings, to Artificial Analysis, or to any other evaluator with no stake in the outcome. The claim and the absence of a way to test it arrived in the same breath.

What an evaluation is for

It helps to be plain about what a benchmark is supposed to do, because the word gets used as though it were self-evidently trustworthy and it is not. An evaluation is an attempt to make a fuzzy quality — “how good is this model” — into something legible and comparable. You fix a set of questions, ideally ones the model has never seen, you define in advance what counts as a right answer, and you run every model through the identical gauntlet. The design matters less than one structural feature: the person scoring should not be the person being scored. That is not a nicety. It is the entire source of the number’s authority. A model developer grading its own model on its own test with its own rubric has produced a fact about its own opinion, dressed in the costume of a measurement.

This is where the Grok claim comes apart, not as a lie but as a category. “Internal evaluations at SpaceX and Tesla” sounds like third-party validation — two separate, serious engineering organisations, surely a tough crowd. But SpaceX, Tesla and xAI are no longer separate parties in any sense that matters here. xAI was folded into SpaceX earlier in the year; the Cursor data now feeding Grok arrived through a SpaceX acquisition; the engineers running these tests draw their pay from the same corporate parent that needs the model to look good. There is no independent party in the room because, by construction, everyone in the room is the same party. The test is real. The independence is missing. And independence was the only thing the test was ever borrowing its credibility from.

A model developer grading its own model on its own test with its own rubric has produced a fact about its own opinion, dressed in the costume of a measurement. — On “internal evaluations”

The honest version of the claim

Let me be exact about my own confidence, because it is the courtesy the story requires. I am not saying Grok 4.5 is worse than Opus. I have no way of knowing that either, and neither does anyone outside xAI, and that symmetry is precisely the point. It is entirely possible the model is as good as Musk says; xAI has shipped capable systems before, and a company throwing a reported 7.7 billion dollars of capital expenditure at the problem in a single quarter is not a company to bet against on effort. What I am saying is narrower and firmer: the specific claim “it beats Opus,” in the form it was made, cannot be confirmed or refuted from the outside, and a claim that cannot be tested is not yet knowledge. It is marketing that happens to be arithmetic-shaped.

The tell is the pattern, not the single instance. There is a standard, unglamorous path a developer takes when it has a genuinely frontier model and wants that recognised: you submit it to the evaluators you do not control, you accept the number they return, and you let the comparison happen on ground you did not choose. It is the harder route and the more convincing one, because the discomfort of a test you can’t rig is the whole reason anyone believes the result. The softer route — announce the outcome, keep the model behind glass, decline the external test, and promise a new model next month so the conversation never settles on this one — produces a headline without the exposure. Which route a lab takes tells you something about what it expects an honest test to say.

None of the individual reasons a lab gives for staying private is absurd on its own. Read together, they describe a room with no windows:

“The model is still being refined” — true of every model ever benchmarked; evaluators test snapshots precisely because nothing is ever finished.
“We test in real business scenarios, not toy benchmarks” — a fair critique of benchmarks, but real-world pilots inside your own subsidiaries are even less independent, not more.
“Competitors could learn from a public submission” — possibly, but every other frontier lab submits and survives; the information leaks both ways.
“A new model is coming next month anyway” — which means the current claim never has to withstand scrutiny before it is superseded by the next unscrutinised claim.

Why this is bigger than one model

If this were only about Grok, it would be a small story — one voluble founder making a claim ahead of the evidence, which is not new and not, by itself, alarming. It matters because it is an early, unusually blunt instance of a direction the whole field is drifting: the quiet migration of evaluation out of public view. For a few years the independent benchmark, for all its flaws — contamination, teaching to the test, gaming the arena — was a genuine commons. It let outsiders, journalists, regulators, and rival engineers argue about capability on shared evidence. That commons is thinning from two directions at once, and Grok is only the more obvious one.

The first direction is commercial: as labs vertically integrate — a model trained on a sister company’s data, tested by a sister company’s engineers, deployed first inside the same corporate group — the natural audience for an evaluation becomes internal, and the incentive to expose the model to a hostile external test weakens with every layer of ownership you add. The second direction is governmental, and it is arriving faster than most people have noticed. The White House is reportedly close to announcing voluntary standards for frontier-model releases; the most advanced OpenAI system is being rolled out under a process where the government vets access customer by customer. Put those together and the trajectory is clear enough: the place where models get judged is moving from public leaderboards into private rooms — the vendor’s room, and the state’s room. Both may have good reasons. Neither is a commons.

The analogy I’d reach for — and then retire, before it does more work than it should — is a drug trial run and scored entirely by the manufacturer, with the raw data sealed and a press release standing in for the paper. We do not accept that arrangement in medicine, not because pharmaceutical companies are uniquely dishonest but because we understood, at some cost, that the party with the most to gain cannot also be the party that certifies the result. AI has not had its version of that lesson yet. It is currently building, in plain sight, the arrangement medicine spent a century learning to distrust, and calling the output a benchmark.

The sharper question

So the interesting question is not the one the announcement invites — is Grok 4.5 really as good as Claude Opus? That question is unanswerable today and will be obsolete in a month, replaced by the same question about a model with a higher number in its name. The sharper question, the one worth carrying past this news cycle, is about who is permitted to check, and what happens to a field when the answer becomes “fewer and fewer people.” A capability claim that only its author can verify is not a fact about the model. It is a fact about the author’s confidence, and about how much of that confidence you are prepared to extend on trust.

There is a version of the coming year in which the models genuinely are extraordinary and the public simply has to take the builders’ word for it, because the tests have all moved indoors. That is a worse world than the noisy, contaminated, argued-over benchmark commons we are leaving, not a better one — and it will feel, from the outside, exactly like progress, because the numbers will keep going up and there will be no one left with standing to ask according to whom. Grok 4.5 might beat Claude Opus. The thing to notice is not the claim. It is that the machinery for finding out is being quietly dismantled while we admire the score.