
Can you even test a model before you ship it?
Washington wants frontier models on the bench 90 days before launch. Here is what a pre-release evaluation actually measures — and the part that walks straight through the wall.
Models, agents, and the labs racing to build them — and the limits nobody can engineer away.

Washington wants frontier models on the bench 90 days before launch. Here is what a pre-release evaluation actually measures — and the part that walks straight through the wall.

For a week I let Gemini Intelligence act before I asked it to. It saved me time, booked me a parking spot I didn't need, and taught me exactly where the trust line sits.

Codex on my phone, driving my Mac. Claude reaching for my apps. An Operator that never stops asking. Seven days of handing over real control — and finding the exact seam where it frays.

A new open-weights model tops the chart every few weeks. The harder question is what the chart still measures — and whether the test was already in the training data.

Language models leak their training data by construction — and the only provable fix is the one no one shipping a flagship will accept.

A new suit against Meta argues the company didn't just read the books to build Llama. It kept them — and the model can be made to prove it.