Analysis · The data wall

Jeff Dean says we're not running out of data. The catch is in the filter.

Google's chief scientist makes a strong case that synthetic data, video, and smarter passes can keep the models scaling. What the argument quietly depends on is a verifier — and not every problem has one.

Phuong Nguyen·Jun 3·10 min

Jeff Dean, chief scientist at Google, photographed in 2025.

Image: Christopher P. Michel / Wikimedia Commons (CC BY-SA 4.0)

For two years now the most repeated worry in AI has been that the models are about to run out of things to read. The public text of the internet is finite, the argument goes; the big labs have already swallowed most of it; the scaling curve that fed on raw tokens is about to hit a wall. It is a tidy story, and it has the great advantage of being partly true. So it is worth paying attention when the chief scientist of Google says, in plain terms, that he is not worried about it.

Jeff Dean — the engineer who co-created MapReduce, co-built TensorFlow, and ran Google Brain, and who is teased inside his own company as the "Chuck Norris of computer science" — made the case in a recent interview with Károly Zsolnai-Fehér of Two Minute Papers. "Everyone has this view that we're running out of training data," Dean said. "And it's true, we've used quite a lot of the public text data in the world." Then the turn: there is video the models barely touch, there are ways to generate synthetic data, there are more passes to take over the data already collected, and there are algorithms that extract more from every token. "I'm not too worried about that as an impediment to making progress."

It is a strong argument, made by someone with more standing to make it than almost anyone alive. It is also an argument that rests, quietly, on a single load-bearing assumption. Worth saying out loud before we repost the optimism: it works where you can check the answer. Where you can't, it works much less well. The whole debate lives in that gap.

The argument, stated fairly

Dean's synthetic-data case is not the lazy version everyone dunks on — the one where you ask a model to write a billion sentences and feed them back to itself. His example is reinforcement-learning rollouts on a hard, well-specified problem, usually code. You take a coding question and let the model attempt it a hundred or a thousand different ways. Then you filter. "Does the code even compile? Well, you can throw out 800 of them right off the bat," Dean said. "Does it pass the unit tests? Does it perform well?" What survives that gauntlet is, by construction, good — and you fold it back into the training set. "More compute will generate you more interesting solutions," he said, "and then those can then be put into the training data."

The augmentation step is genuinely clever. Once you have a program that works, you have a fully specified behavior, and you can ask for the same thing in another language. "I generated the solution in Python. Now I could generate a solution in Go and have more Go programming language training data," Dean said. Compare that to the old computer-vision trick of nudging an image a few pixels to manufacture a "new" example; here the augmentation is an entire second language with the same verified behavior underneath it. Google has done this internally, he said, taking Python tools and their tests and producing faster versions in other languages. More capability out of the same seed of data. That part is real, and it is not hype.

Compared to what?

Here is the seam. Notice what is doing the work in that pipeline: not the generation, the filter. The model proposes a thousand answers; a cheap, trustworthy oracle — a compiler, a unit test, a performance number — throws away the 800 to 999 that are wrong. The "needle in the haystack" that Dean says a system with enough compute can find is only findable because something other than the model gets to say, definitively, which straws are not the needle. The synthetic data is valuable in exact proportion to how good and how honest that verifier is.

Code and mathematics are the blessed cases. They come with verifiers built in: it compiles or it doesn't, the test is green or it's red, the proof checks or it fails. That is why the most convincing reasoning gains of the past year have clustered in exactly those domains. Now ask the same question of the domains that don't ship with an oracle — is this essay good, is this legal summary right, is this medical advice safe, is this "a cool space invader game." There the filter is another model's opinion, which means you are grading synthetic data with synthetic judgment, and the error bars get wide fast.

The model proposes a thousand answers; a compiler throws away the wrong ones. The synthetic data is worth exactly as much as the verifier that filters it — and most problems don't come with a compiler. — Phuong Nguyen

This is not a hypothetical objection. The failure mode has a name and a peer-reviewed paper: a 2024 study in Nature showed that models trained recursively on their own unfiltered output degrade and eventually collapse, the tails of the distribution thinning until the thing forgets the rare cases that made it useful. Dean's pipeline is the designed-to-survive answer to exactly that result — filtering and verification are what stand between "synthetic data helps" and "synthetic data eats the model." Both things are true, and which one you get depends entirely on whether you have a real verifier or a flattering one. So when Dean says it works "in general" but allows there are "a lot of details to get right," believe both halves. The details are the verifier, and the verifier is the whole ballgame.

The compute is moving from learning to using

The other claim in the interview that deserves to travel is about where the machines' time actually goes. Dean cited Nvidia's Bill Dally's line that something like 90 percent of modern data-center machine-learning compute is now inference, not training — the models are being used far more than they are being built. Strip out the caveats (search, Gmail, and the rest of Google's non-ML load are their own large slice) and the ratio inside the ML workload still tilts hard toward serving requests and running agents.

That shift has a concrete consequence in silicon, and it is measurable rather than rhetorical. Inference has different physics from training — lower precision, fixed weights, enormous request volume — so it pays to build chips that do only that. Google's seventh-generation TPU, Ironwood, was pitched explicitly as a chip "for the age of inference" when it reached general availability late last year. Its eighth generation, previewed since, splits the line in two: a training part and a dedicated inference part (codenamed Zebrafish) aimed at high-volume, long-context serving. Dean's prediction is simply more of this. "You'll see even more specialization," he said.

Specialization is also why the precision numbers have gotten absurd. The field now runs a lot of inference in FP4 — four bits per weight, barely enough to count to sixteen. "If you told that to a computer scientist from 15 years ago, they'd be like, 'FP what? That's not enough numbers,'" Dean said. And yet high-quality output comes out the far end. The honest framing isn't that four bits is magically sufficient; it's that you can claw precision back where it matters with block formats — a cluster of ultra-low-bit weights sharing one higher-precision scaling factor. Dean's open question is the right one to watch: how often do you need that scaling factor — every 64 weights, every 128, every 256? The answer is a real engineering number, and it will move.

The 'magic sauce' is an expensive teacher

Press almost any lab on why their small, cheap model is suddenly close to their flagship and you get a knowing smile and a reference to proprietary technique. Dean offered the same smile — "there is always some magic sauce that we don't reveal" — but also, unusually, named the mundane mechanism underneath it. Distillation. Google's open Gemma models, he confirmed, are distilled from larger, higher-quality models; its fast Flash models are taught by the heavier Pro ones. The reason a distilled model can land within a few points of the frontier on a hard benchmark is not sorcery; it is that a very capable, inference-inefficient teacher did the expensive learning first.

Which quietly reframes the open-versus-closed debate that consumes so much oxygen. "It's not so much one of closed versus open," Dean said. If you want small, incredibly capable models — open or closed — "we have to keep building larger scale models that are maybe less inference efficient but are more capable, and then use distillation to transfer the knowledge." Read plainly, that is a concession worth holding onto: open models that improve by distillation are, to a real degree, standing on the shoulders of the frontier models they learn from. The workhorse that most people actually use — the Flash-tier model that is "almost as capable" — exists because something far more expensive was trained first. That is the measurable story, and it is more interesting than the slogan.

The problem he can't crack

The most credible thing in the interview was a thing Dean said he has failed at. Asked for a problem he has tried to solve many times and never cracked, he named continual learning — the holy grail of a model that keeps learning from experience instead of being frozen, shipped, and eventually retired. He finds today's hard split between pre-training and post-training "a little intellectually dissatisfying," and wants something interleaved: see some data, act on it, learn from the consequences, repeat, the way an agent writing code learns more from watching the code fail than from passively reading tokens.

He is also clear-eyed about why it is hard, and the reason is not only technical. A model that never stops learning is a model you can never finish testing. Today you train, you red-team, you run the safety protocols, you ship a fixed artifact you have actually evaluated. A continuously learning system has no such frozen state to certify — "how do you know that this intermediate state is actually safe?" The honest answer is that no one fully does yet. "If we're able to crack that, it's going to be amazing," Dean said. "But it's not there yet." Said loudly, without a roadmap. That kind of admission is worth more than a dozen confident slides.

Even the bits flip

One last detail, because it rhymes with everything above. Asked whether the cosmic-ray story is real — a distant supernova throws off a particle, it strikes a memory cell, a zero flips to a one — Dean said yes, and said Google has the monitoring data to prove it: clusters facing a particular direction show a brief spike in single-bit memory errors while clusters on the other side of the Earth don't. In the early days Google ran on consumer machines with no error-correcting memory at all, and survived by building reliability at a higher level — software checksums, corrupt records simply discarded. "How do you build reliable systems out of unreliable parts?" is, he said, a founding question of the company.

It is also the right note to end a data argument on. The optimistic case — we are not running out of data — is correct in the domains where the field has a verifier honest enough to throw out the wrong answers, and shakier everywhere else. The pessimistic case isn't that the tokens ran dry; it's that the cheap, trustworthy oracle does not exist for most of what we actually want models to do. Dean has handed the industry a genuinely good answer. The footnote, as ever, is where the work is: build the filter before you trust the data, because even a single bit will lie to you if you let it.

Jeff Dean says we're not running out of data. The catch is in the filter.

The argument, stated fairly

Compared to what?

The compute is moving from learning to using

The 'magic sauce' is an expensive teacher

The problem he can't crack

Even the bits flip

References

Read next

A Chinese lab released the largest open model ever. The U.S. stock market scored it first.

They fit a 27-billion-parameter model on an iPhone. The compression is real. The capability is the part nobody measured.

The voice that doesn't wait its turn

One email. Every Friday.