Agents

I gave an agent the keys to my laptop for a week

Codex on my phone, driving my Mac. Claude reaching for my apps. An Operator that never stops asking. Seven days of handing over real control — and finding the exact seam where it frays.

Daniel Vance·May 27·8 min

A pair of hands typing on a laptop keyboard at a desk.

Photograph: Vitaly Gariev / Unsplash

For one week I stopped watching my laptop work and started watching my phone watch my laptop work. On a Monday morning in Singapore I scanned a QR code, the Codex app on my Mac blinked, and from then on I could approve commands, read diffs, switch models and kick off new tasks from a train platform while my MacBook sat at home doing the actual labour. By Wednesday it had refactored a small project of mine cleanly, run the tests, and waited politely for permission to push. By Thursday it had also opened the wrong browser tab, OCR'd a half of my screen I'd rather it hadn't, and confidently misread an instruction that any half-awake human would have caught. Both of those things are true, and the gap between them is the whole story of computer-use agents right now.

This is not a benchmark piece. I am not going to tell you that Claude Opus 4.8 scores 84% on Online-Mind2Web — though it does, and that number is genuinely a leap — because a score on a test set is not the same as a tool surviving your Tuesday. What I wanted to know was simpler and harder: if you hand one of these things real control of your apps and accounts for a week, what actually breaks? And the answer, every time, was the same. It was never the task. It was the handoff between tasks.

The setup: three agents, one nervous week

I ran three. OpenAI's Codex, which on 14 May got remote access through the ChatGPT mobile app — you steer it from your phone while it runs on a Mac, with screenshots, terminal output, diffs and approval requests flowing back in real time, and your files, credentials and permissions staying on the machine. Anthropic's Claude computer use, the feature that lets Claude click, scroll, open apps and fill spreadsheets on your behalf, in research preview since March and sharpened by Anthropic's February acquisition of the Seattle startup Vercept. And OpenAI's old Operator, now folded into ChatGPT's agent mode, which drives a cloud browser rather than your own machine.

Three different bets on the same question — how much should you be allowed to look away — and three different answers. Codex bets you'll glance at your phone. Claude bets you'll stay in the room. Operator bets, frankly, that you have nothing better to do than confirm dropdowns.

Codex: the agent you can almost walk away from

Codex was the one I lived with most, because it was the one built for exactly this experiment. The macOS app, since April, drives third-party apps with its own cursor — it can navigate an interface to test a UI, switch into a browser to fetch a page, run multiple background tasks without trampling whatever you're doing in the foreground. Powered by GPT-5.5, it is fast and, on well-bounded coding work, genuinely good. The mobile remote control is the part that changes the feel of it. You are not babysitting a window. You are getting pinged when judgment is needed.

In practice that worked best when the work was a clean chain inside one domain. "Fix the failing test, run the suite, show me the diff" — Codex handled that end to end and paused for the push, which is exactly where you want it to pause. The catch arrives the moment two unlike tasks touch. I asked it to pull a number out of a webpage and update a config file with it. It fetched the page, read the wrong figure off a similar-looking row, and wrote it in with the same calm it had shown when it was right. A wrong value wearing a green checkmark is not progress. It's a bug you now have to go find.

An agent looks magical doing one clean task in a controlled room. It gets human fast the moment two tasks touch — and a confident guess wearing a diff is just a mistake you have to go and find.

Then there's Chronicle, the screen-memory feature that takes periodic screenshots, OCRs them and stores summaries as local Markdown. Useful, in theory — the agent remembers what you were doing. Unsettling, in practice. OpenAI's own documentation tells you it burns rate limits quickly, raises the risk of prompt injection, and stores those memories unencrypted on your device, and it suggests you pause Chronicle before meetings or anything sensitive. Read that again: the safety control is you remembering to flip a switch. I forgot once, mid-week, and found a tidy little Markdown record of a private message thread sitting in plain text on my disk. That is not a bug. That is the design asking more of me than I asked of it.

Claude: the one that keeps you honest

Claude computer use is the most capable driver I tested and, not coincidentally, the one that most wants you watching. It opens apps, navigates browsers, edits spreadsheets, makes changes in an IDE and submits pull requests. The Vercept team it absorbed in February had spent years on exactly the hard part — getting a model to see and act inside the same software humans use — and you can feel that pedigree in how rarely it gets visually lost. Where Codex sometimes clicks the plausible thing, Claude more often clicks the right thing.

And still it asks permission before touching a new app, every time, which after a day starts to feel less like a guardrail and more like the honest admission it is: nobody, including the people who built these, fully trusts them to run unsupervised yet. The good news is the trust can now live somewhere you control — Anthropic's managed agents can operate in a sandbox you configure and reach private MCP servers, with self-hosted sandboxes in beta. That's the right direction. It means the question stops being "do I trust the agent" and becomes "do I trust the box I've put it in," which is a question an engineer can actually answer.

The limitation worth saying plainly: it's Mac-only, like Codex's device control. If you're on Windows, this entire week of mine is a coming-attractions trailer, not a product you can buy.

Operator: the agent that quit by interrupting

I include Operator — now ChatGPT's agent mode, driving a cloud browser — mostly to mark the contrast, because it is the clearest picture of the failure the others are trying to escape. Agent mode is real and sometimes impressive: it has a visual browser, a text browser, a terminal and API access, and it can genuinely plan a multi-step web task. But it stops. It hits a login and pauses. It meets a dropdown it's unsure about and asks. It reaches a checkout and wants to double-check. Independent reviewers this spring clocked it at 32.6% on a real-world agent benchmark, and that number maps neatly onto the lived experience: roughly a third of the time it finished the job, and the rest of the time it finished asking me about the job.

Here is the thing nobody markets. An agent that interrupts you constantly is not a cautious agent. It's an agent that has offloaded its uncertainty onto you and called it safety. I spent more attention supervising Operator than I'd have spent doing the tasks myself. That's the trap at the bottom of this whole category, and it's the one Codex and Claude are sprinting away from in opposite directions — Codex by letting the interruptions reach your phone instead of your desk, Claude by getting good enough to need fewer of them.

What actually breaks: the handoff, always the handoff

By the end of the week the pattern was boring in its consistency. Single, bounded, in-domain tasks: mostly fine, sometimes great. The seam was always the join — pass a result from one task to the next and the agent doesn't carry your intent across, it carries a literal artifact, and if that artifact is slightly wrong the error propagates with full confidence and no flag. The agents are excellent at doing. They are still poor at noticing when they should stop doing and ask. The trouble is that the model that asks too much (Operator) is exhausting, and the model that asks too little (Codex, on a bad guess) is dangerous, and nobody has yet found the dial in the middle.

So: when can you stop watching? My honest answer after seven days is that you can stop watching the screen, but not the seams. Set the agent on a self-contained task with a clean definition of done and go make coffee — that's now real, and the phone-tether on Codex makes it realer. But the instant a job spans two domains, two accounts, two apps that have to agree, get back in the room. That's not where the magic is. That's where the magic frays.

The verdict, and the scorecard

Who it's for: developers and power users with bounded, repetitive, well-specified work — test-and-fix loops, scripted refactors, form-filling drudgery — who are on a Mac and willing to define "done" tightly. For them, Codex with mobile remote control is the one I'd start with; it respects your attention more than anything else I tried, and the handoff-to-your-phone is the genuine innovation of the month. Claude is the one to reach for when correctness matters more than convenience and you'll stay nearby; it's the better driver. Operator I'd wait on until the asking-to-doing ratio inverts.

My running subscription graveyard gained no new headstones this week, which is rare and worth saying — I'm keeping Codex and Claude on. But I'm keeping them the way my mother kept a customer's half-fixed radio on the bench: plugged in, doing useful work, and never once left alone in the shop overnight. The agents earned the keys to the laptop. They have not yet earned the keys to the building. Ask me again in a month.

I gave an agent the keys to my laptop for a week

The setup: three agents, one nervous week

Codex: the agent you can almost walk away from

Claude: the one that keeps you honest

Operator: the agent that quit by interrupting

What actually breaks: the handoff, always the handoff

The verdict, and the scorecard

References

Read next

They fit a 27-billion-parameter model on an iPhone. The compression is real. The capability is the part nobody measured.

The voice that doesn't wait its turn

OpenAI merged its coding app into ChatGPT and called the result Work. I gave it two days and a real deadline.

One email. Every Friday.