
Can you even test a model before you ship it?
Washington wants frontier models on the bench 90 days before launch. Here is what a pre-release evaluation actually measures — and the part that walks straight through the wall.
For a week I let Gemini Intelligence act before I asked it to. It saved me time, booked me a parking spot I didn't need, and taught me exactly where the trust line sits.

Photograph: Daniel Romero / Unsplash
For one week I let my phone act before I asked it to. Google calls this Gemini Intelligence, unveiled on May 12 at The Android Show: I/O Edition, and the pitch is that your devices now work “proactively to get things done throughout your day.” Not a chatbot you summon. An assistant that has already started. By Wednesday it had reserved me a parking spot near an appointment I'd already cancelled. By Friday it had genuinely saved me an afternoon. Both things are true, and the distance between them is the entire story of proactive agents right now.
I tested it the only way I trust: a real week, on a Pixel 10, with a Wear OS 7 watch on my wrist and Android Auto in the car, doing the actual unglamorous things a Tuesday is made of. Not the demo. The Tuesday.
The headline feature is multi-step automation across apps. You press the power button, describe a task, and Gemini uses what's on your screen as context and tries to finish it. Google's own examples are mundane on purpose: snag a front-row bike for a spin class, find a syllabus in Gmail and drop the listed books in a shopping cart, photograph a grocery list and have it build a delivery order. In practice, the boring tasks are exactly where it shines.
The grocery-list trick is the one I'd actually keep. I long-pressed a scrawled list in my notes app, asked Gemini to build a cart, and it did — matching brands sensibly, flagging two items it couldn't find, and stopping dead at checkout to make me confirm. That pause is the whole design. Gemini, Google says, “only acts on your command and stops the moment the task is complete,” and for anything that spends money or posts publicly it asks first. The handoff back to me is where the trust lives.
Then there's Auto Browse, arriving in Chrome on Android in late June, first for AI Pro and Ultra subscribers in the US. It handles “more mundane tasks on your behalf” — appointment booking, reserving a parking spot. This is where proactive stopped feeling like help and started feeling like a colleague who's a little too eager.
On Wednesday I had a dermatologist appointment in my calendar that I'd rescheduled by text but never deleted. Gemini, doing its job, saw the appointment, noticed the location, and — because I'd let Auto Browse off its leash for the week — lined up a parking reservation nearby. It asked before paying, which is the only reason this is a funny anecdote and not an angry one. But the confidence was the tell. It had stitched two facts together — calendar says appointment, appointment needs parking — without the one fact that mattered, which is that the appointment no longer existed.
This is the seam every demo hides. An agent looks brilliant doing one clean task in a controlled room. It gets human fast the moment two tasks touch and one of them is stale. A guess wearing a confirmation prompt is still a guess; it's just a politer one.
A guess wearing a confirmation prompt is still a guess. It's just a politer one.
The fix, when it came, was me. I learned by Thursday to treat every proactive suggestion as a draft, not a decision. The system is built for exactly that posture — it confirms before anything irreversible — but the burden quietly shifts from doing the task to auditing the task. Some weeks that's a great trade. Some weeks auditing is the work.
Gemini Intelligence comes to Wear OS, Android Auto and Android XR later this year, and the watch surprised me. On the wrist, proactive is sized correctly. Google says Wear OS 7 Gemini is meant to “anticipate needs, surface relevant information at the right time, and reduce friction in everyday tasks,” and a small screen forces a kind of honesty: it can't dump ten options on you, so it offers one. “Start tracking my run” opened Samsung Health and started. Delivery status, ride updates, a workout summary — short, glanceable, and almost never wrong because the stakes were almost never high.
Android Auto was the opposite lesson. Proactive routing suggestions in the car are useful right up until they're presumptuous, and a presumptuous suggestion at 60kph is a different animal than a wrong shopping item. I turned the chattier features off after two days. The catch with proactive isn't that it's wrong often; it's that the cost of wrong scales with the context, and the system treats a grocery item and a lane change with the same cheerful certainty.
Wear OS 7 claims up to a 10% battery improvement over Wear OS 6 from software optimization alone, which I'll take, because an assistant that's always watching for the right moment is, by definition, always watching. On the phone, a week of proactive Gemini was a noticeable but survivable battery tax — the price of an agent that wakes up on its own.
On privacy, Google has clearly done its homework on the messaging. Autofill via Personal Intelligence is “strictly opt-in” with a toggle, Rambler's voice transcription happens in real time and “is not stored or saved,” and Gemini runs “only in apps that you permit it to,” with granular control over what data is shared. The on-paper story is good. The lived-in story is that proactive features are only as useful as the data you feed them, and the system gently, constantly, asks for more connections — more apps, more context — because that's what makes it smarter. The friction isn't a single scary permission screen. It's the slow negotiation over how much of your life you wire in to make the magic show up.
It's worth saying Google isn't alone here. Microsoft is pushing the same idea into work: Microsoft 365 Copilot's agents and Copilot Cowork can take a described outcome, break it into steps, and carry tasks forward across hours with “visible progress and opportunities to steer.” Same philosophy, same trust question — just pointed at your inbox instead of your errands. The whole industry has decided, at roughly the same moment, that the assistant should move first.
After a week, here's the line I drew, and I think it's the line most people will draw too. Proactive is genuinely useful when the task is reversible, low-stakes, and tedious: building a cart, surfacing a tracking number, drafting a reply, starting a run. It's presumptuous the instant the task is irreversible, high-stakes, or depends on a fact the system can't verify — spending money, booking on stale data, anything in a moving car.
Gemini Intelligence rolls out this summer on the latest Samsung Galaxy and Pixel phones, widening to watches, cars and glasses later this year, so most people will meet it gradually rather than all at once. That's the right speed. Proactive AI is the rare feature where the technology is ready before the trust is, and trust is the only thing that makes it usable.
Who it's for: anyone who does the same small digital chores every week and is willing to spend the first few days teaching the system where its edges are. The payoff is real — I got my afternoon back, and the watch alone justified leaving it on.
Who should wait: anyone who'd let it act on money or appointments unsupervised, because the confidence is total and the verification is yours. Goes on the kept list, with the proactive dial turned down to about half. The assistant that acts before you ask is a genuinely good idea. It just hasn't learned yet that acting before you ask is also exactly how you book a parking spot for an appointment that no longer exists.

Washington wants frontier models on the bench 90 days before launch. Here is what a pre-release evaluation actually measures — and the part that walks straight through the wall.

Codex on my phone, driving my Mac. Claude reaching for my apps. An Operator that never stops asking. Seven days of handing over real control — and finding the exact seam where it frays.

A new open-weights model tops the chart every few weeks. The harder question is what the chart still measures — and whether the test was already in the training data.