Analysis · Frontier models

Claude Opus 4.8 can be told how hard to think. The interesting part is why that became a feature.

Anthropic's new flagship adds effort control, cheaper fast inference, and what it calls improved honesty. Two of those three are quieter, and more revealing, than any benchmark score.

Amy Mercer·Jun 5·9 min

Anthropic's Claude Opus 4.8 announcement artwork.

Image: Anthropic

Something easy to miss happened when Anthropic released Claude Opus 4.8 last week. Alongside the expected language about stronger benchmark performance, the company added a control that lets you decide how much effort the model spends on your request. You can tell it to try harder. You can tell it to try less. Stop on that for a moment, because it is a stranger sentence than it looks. We do not usually get a dial for how hard a piece of software is working. A spreadsheet does not offer to think about your sum more carefully if you ask it nicely. The fact that this control now exists, and that Anthropic chose to put it near the front of the announcement, says more about where these systems actually are than the benchmark line above it.

Three things shipped with Opus 4.8 that are worth separating out. There is effort control in the consumer product. There is a faster, cheaper mode of inference. And there is what Anthropic describes as improved honesty. The benchmark gains are real and I have no particular reason to doubt them, but they are also the least interesting part, because better scores are what every release claims. The other three tell you what problem the frontier labs think they are now solving, and the problem is no longer raw capability. It is control, cost, and trust. Two of those three are quietly revealing, and one of them is doing a great deal of work with a single word.

What 'effort' actually is

Start with the dial, because it is the most honest of the three about what is going on underneath. When you tell Opus 4.8 to spend more effort, you are not appealing to its diligence. You are spending more compute. Modern reasoning models work, in part, by generating intermediate steps before they answer, a kind of visible deliberation, and the more of that deliberation you allow, the more chances the model has to catch its own mistakes and refine its answer. Effort control is a knob on how much of that the model is allowed to do before it commits.

Put plainly, it is a quality-for-cost trade made visible. More effort means more tokens generated, which means more time and more money per answer, in exchange for a better chance of getting it right. Less effort means a faster, cheaper response that is more likely to be shallow or wrong on a hard problem. None of that is a criticism. Exposing the trade-off to the user is a sensible and, in its way, candid thing to do. But notice what it concedes. The model's quality is now openly a function of how much you are willing to spend per question. That is a different kind of product than one with a fixed capability you either have or do not. The capability is a curve, and you are being handed the slider.

This matters most in the places where people are least equipped to judge it. A developer running a benchmark can afford maximum effort and measure the result. A person asking a hard medical or legal question in the consumer app has no way to know whether the answer they got was the careful one or the cheap one, and no reliable intuition for when to turn the dial up. The control is real. The literacy to use it well is not evenly distributed, and the cost of guessing wrong is not evenly distributed either.

The cheaper answer is the strategy

The second feature is economic, and it points the same direction. Anthropic says its fast mode for Opus 4.8, where the model runs at roughly two-and-a-half times the speed, is about three times cheaper than the equivalent mode was on the previous generation. Strip out the adjectives and that is a statement about the price of an answer falling, fast, at the top of the line.

This is the same shift that Google's Jeff Dean described in an interview we covered earlier this week, viewed from the other end. The expensive, difficult work of training a frontier model happens once. The work of serving it, answering billions of individual requests, happens forever, and it is where the cost actually lives. So the competition has quietly moved from who has the most capable model to who can serve a capable-enough answer most cheaply and quickly. Effort control and a cheaper fast mode are two expressions of one idea: the product is no longer a single number on a leaderboard. It is a menu of prices for an answer, and the labs are competing on the menu.

The capability is no longer a number you either have or do not. It is a curve, and you are being handed the slider. — Amy Mercer

That is a healthier kind of competition than the benchmark arms race in some ways, because cost and speed are things a user actually feels, and they are harder to game than a test. But it also means the headline capability is becoming a less useful thing to ask about. The interesting question for a buyer is not how smart the most expensive setting is. It is how good the cheap setting is, because the cheap setting is what most requests will quietly be served by, the same way the fast, distilled models rather than the flagship are what most people end up using without noticing.

The word 'honesty' is carrying a lot

Then there is the third claim, that Opus 4.8 is more honest than its predecessors, and this is the one I want to slow all the way down on, because it is the one where the language outruns what can be checked. My instinct, inherited from a childhood of being asked 'according to whom,' is to ask what is actually being measured when a company says its model got more honest.

Here is the careful version. A language model does not have beliefs that it can then choose to state or conceal, which is what honesty means for a person. What it has is behaviour. So 'honesty' in this context is a cluster of measurable behaviours: whether the model admits uncertainty instead of inventing an answer, whether it declines to fabricate a citation, whether it avoids telling you what it senses you want to hear, whether the confidence it expresses matches how often it is actually right. That last one has a name, calibration, and it is genuinely important. A model whose stated confidence tracks its real reliability is far safer to build on than one that is fluent and wrong with equal poise. Improving these things is real work and it is worth doing.

But 'more honest' and 'honest' are different claims, and the gap between them is exactly the part the word hides. 'More honest' means the model scored better on Anthropic's honesty evaluations, on the distribution of prompts those evaluations cover. Whether that improvement holds on the prompts the evaluation did not think to include, or under adversarial pressure from a user trying to push the model off its calibration, is precisely the thing a single benchmark cannot tell you. I have watched a related claim, machine unlearning, degrade under pressure in just this way: a model that has been made to 'forget' something will often still cough it up given the right prompt, because the suppression is behavioural, not structural. I would expect honesty to have the same texture. It is a tendency that has been strengthened, not a property that has been installed.

And there is the accountability point, which is the one I keep returning to with these systems. The honesty claim is, at present, something only the vendor can audit. Anthropic measured its own model against its own definition of honesty and reported that the number went up. I am not suggesting the company is being dishonest about its honesty. I am noting that there is no independent instrument here, no held-out test run by someone with no stake in the result, and that 'we evaluated ourselves and improved' is the structure of a claim that deserves verification rather than applause. When the model confidently asserts something false to a user who trusted the honesty framing, the question of who is accountable does not have a satisfying answer yet, and the marketing of honesty quietly raises the stakes of that unanswered question.

The agents in the background

One more feature deserves a mention, because it connects to all of this. Opus 4.8 also brought what Anthropic calls dynamic workflows to Claude Code, its coding tool, aimed at letting the model break down and work through very large-scale problems with less hand-holding. This is the agentic direction the whole field is moving in, and we have written before about how coding became the proving ground for it. The capability is impressive and I do not want to wave it away.

But it raises the same question the other features do, in a sharper form. An agent that works autonomously through a large problem is an agent making a long chain of small decisions you did not individually approve, at whatever effort level it was set to, with whatever honesty its calibration affords. The more capable and more autonomous these systems get, the more the interesting questions stop being about the model in isolation and start being about the loop it runs in: what it can touch, what it can spend, what it will admit when it is unsure, and who is watching. None of those are capability questions. All of them are accountability questions, and accountability is the thing benchmarks do not measure.

What the release is really telling you

Read together, the Opus 4.8 release is a useful snapshot of where the frontier's attention has gone. The capability gains are assumed, almost boilerplate. The features the company chose to foreground are a control for how much you pay to think, a cheaper price for a fast answer, and a claim about trustworthiness that only the seller can currently verify. Those are the three axes along which a very powerful model becomes a usable, sellable product, and it is genuinely telling that they have displaced raw capability as the headline.

So the question to carry out of this is not whether Opus 4.8 is better than what came before. By the measures Anthropic chose, it almost certainly is, and those measures are not nothing. The sharper question is what it means that the two features placed in front of us are a dial for the cost of an answer and a claim about honesty we are being asked to take partly on faith. We are being handed more control and asked for more trust in the same release, and only one of those two things arrives with a number you can check yourself. The control is the part you can feel. The trust is the part worth reading the appendix for, once there is an appendix written by someone other than the vendor.

References

Machine learning

Claude Opus 4.8 can be told how hard to think. The interesting part is why that became a feature.

What 'effort' actually is

The cheaper answer is the strategy

The word 'honesty' is carrying a lot

The agents in the background

What the release is really telling you

References

Read next

Google delayed its flagship model for missing a coding target. Nobody outside Google can see the target.

A Chinese lab released the largest open model ever. The U.S. stock market scored it first.

They fit a 27-billion-parameter model on an iPhone. The compression is real. The capability is the part nobody measured.

One email. Every Friday.