Analysis · Open weights

Google's Gemma 4 is small enough to run on your laptop. "Any laptop" is the overclaim.

Gemma 4 12B is a real shrink: a capable multimodal model in about 16GB. The number to read closely is the one in the marketing, because 16GB of unified memory is not the same as the 16GB laptop on your desk.

Phuong Nguyen·Jun 4·10 min

Image: Norbert Levajsics / Unsplash (CC0)

For about a week the pitch has been everywhere, repeated with the kind of confidence that should always make you go and read the footnote: Google has built a capable, multimodal AI model that runs on your laptop. No data centre, no API bill, no account, no waiting on someone else's server. Download the weights and go. The model is real, the release is genuinely a big deal, and most of the excitement is earned. The part that needs a second look is the smallest word in the sentence. Not "capable." Not "multimodal." The word doing the heavy lifting is "any."

I want to be clear up front, because skepticism gets misread as dismissal: this is one of the more impressive things any lab has shipped this year. I just think it is being described with a roundness that the spec sheet does not support, and the gap between "runs in 16GB" and "runs well on the 16GB laptop you already own" is exactly the sort of gap that gets flattened in a headline and then quietly disappoints a few hundred thousand people who tried it on the machine they had.

What Google actually shipped

On June 3, Google DeepMind released Gemma 4 12B, an open-weights model of just under 12 billion parameters. The exact figure floating around is 11.95 billion, which is the kind of detail I trust more than a rounded one. It carries an Apache 2.0 license, about as permissive as these things get, and it handles text, images, and audio. The weights are up on Hugging Face and Kaggle, and it runs in the tools people already use for local models: llama.cpp, MLX on Apple machines, vLLM and SGLang for anyone serving it at any kind of scale. None of that is marketing copy. You can have it on your disk this afternoon.

The headline spec, the one in every writeup, is that it fits in roughly 16GB of memory. That is the claim worth slowing down on, and I will. But it is worth establishing first why the rest of the release stands on its own, because the easy move here is to treat "runs on a laptop" as the whole story when the more interesting work is underneath it.

The clever part is the architecture, not the size

Most models that call themselves multimodal are, under the hood, a language model with separate vision and audio encoders bolted onto the side. The image gets turned into something the language model can read by one module, the audio by another, and those modules cost memory, add latency, and generally make the whole thing heavier than its parameter count suggests. Gemma 4's headline trick is an encoder-free design Google calls "Unified." Raw audio waveforms and visual patches flow directly into the core of the model, without the secondary processing stages. Fewer moving parts, less overhead, and a meaningful chunk of why the memory footprint stays as small as it does.

That is a real engineering result and it deserves more attention than the laptop angle is giving it. If the number that matters to you is capability-per-gigabyte, the architecture is the reason the number moved, not some marketing decision to call a 16GB requirement a feature.

The other spec being passed around is a 256,000-token context window. In plain terms that is a few hundred pages of text, or an hour-plus of transcript, or a fair-sized code repository, all held in the model's working memory at once. Useful, genuinely. But here is the footnote nobody reads aloud: context is not free. Holding 256K tokens in attention costs memory, and that memory comes out of the same budget as the model itself. A model that comfortably fits in 16GB at a short prompt is not necessarily the same model that fits in 16GB while juggling a quarter-million tokens of context. The window is a ceiling you can reach, not a room you get for nothing.

"16GB" and "a 16GB laptop" are not the same number

Here is the seam in the marketing. The requirement is described as around 16GB of VRAM or unified memory. Those last two words are carrying the entire claim, and most people repeating it do not stop on them.

On an Apple Silicon Mac, memory is unified. The CPU and GPU share one pool, so a 16GB MacBook can in principle hand most of that pool to the model and let it work. That is the friendly case, and it is almost certainly the machine the "runs on a laptop" demo was filmed on. On a typical Windows laptop, 16GB usually means 16GB of system RAM, and the graphics are either integrated, sharing that very same RAM with Windows, your browser, and everything else you have open, or a discrete GPU with its own separate and usually smaller pool of VRAM. Same number on the box, completely different machine for this purpose.

16GB on an Apple Silicon Mac and 16GB on a mid-range Windows laptop are two different machines wearing the same number on the box. — Phuong Nguyen

There is a second assumption hiding in the number. To get a 12-billion-parameter model into 16GB at all, you are almost certainly running it quantized, compressed from its native precision down to something like four bits per weight. Quantization is standard, it works, and it is genuinely impressive how little quality you lose these days. But it is a compression, and it is the thing that makes the headline true. The honest version of the sentence is something like: a four-bit quantized Gemma 4 12B fits in about 16GB of memory that the model is actually allowed to use, which on a lot of real laptops means closing your other apps, tolerating slower output, or discovering it does not quite fit once Windows takes its cut.

I keep stressing this not to take the shine off, but because "runs on any laptop with 16GB" is precisely the kind of frictionless, round claim that travels faster than it deserves. Compared to a year ago, fitting this much capability into this little memory is a real leap, and people should be impressed. The thing to resist is the slide from "runs in 16GB" to "runs nicely on the cheap laptop in your bag." Those two are separated by quantization quality, memory bandwidth, how much of your RAM the rest of your system is holding, and whether your 16GB is unified or shared. None of that fits in a tweet, which is why none of it is in the tweets.

"Nearly as good as the 26B" is a benchmark sentence

The other line doing the rounds is that Gemma 4 12B lands close to Google's larger 26-billion-parameter mixture-of-experts model on benchmarks. Treat that the way you should treat every "nearly as good as" in this field, which is to ask three questions before you repeat it: near on which benchmarks, measured how, and was the small model built with the big one explicitly in mind.

That last question answers itself. Distillation, where a large model teaches a smaller one, is exactly how you get a 12B model performing like a 26B on a test. Google's own chief scientist said as much in an interview we covered yesterday: the flash-tier models are close to the frontier because the frontier model taught them. It is a legitimate technique and it is most of the reason small models have gotten so good so fast. It also means "near the 26B on benchmarks" and "as capable as the 26B on your actual work" are two different claims that happen to share a sentence. Benchmarks are where distillation shows its best face. The gap tends to reopen on the long-tail, out-of-distribution tasks that benchmarks systematically under-sample.

The figure I would want before repeating "nearly as good" is the spread, not the single headline score. Near on aggregate can hide far on the one thing you actually need it for. Until there is an independent evaluation on genuinely held-out tasks, "near 26B" means "near 26B on the tests Google chose to publish," and that distance could be small or it could be the whole point. I am not saying the claim is wrong. I am saying it is unverified in the way that matters, and the people repeating it are quoting a vendor about its own homework.

The license is the real headline

Here is the part I think is under-discussed, and it is the part that actually changes things: the license. Apache 2.0 weights mean you can run Gemma 4 on your own hardware, modify it, fine-tune it, and ship it inside a product without asking Google for permission and without sending your data anywhere. For anyone who cares about privacy, about cost, or about not building a business on top of someone else's metered API, that is the story, not the laptop demo.

A model that runs locally is a model that does not log your prompts to a vendor, does not fall over when that vendor has an outage, and does not get more expensive the morning the vendor decides to reprice. Those are not small properties. The entire anxiety of the current moment, where everyone is renting intelligence by the request and watching the meter, is exactly the thing a good local model relieves.

It is also, to be clear-eyed about it, a strategy rather than a gift. Google gives away a strong open model because a capable local model that developers build on is a model that pulls those same developers toward Google's larger, hosted, paid models the moment they need more than a laptop can give. The free local tier and the expensive cloud tier are not in tension. They are the same funnel. That does not make the free model less useful to you. It just means you should read the generosity as a business decision, because it is one.

What to actually expect if you download it

On an Apple Silicon Mac with 16GB or more of unified memory, through MLX or llama.cpp, expect it to run quantized at usable speeds for text, and noticeably slower as you push the context window toward its limit.
On a Windows laptop with 16GB of system RAM and integrated graphics, expect it to be tight. It may run, but shared memory and lower bandwidth will cost you speed, and a long context will be felt as a memory tax you cannot pay twice.
Image and audio input work, which is the genuinely fun part, but multimodal inference uses more memory than plain text, so the comfortable machine is one with real headroom above 16GB rather than exactly 16GB.
"Runs" and "runs at a speed you will actually tolerate" are different tests. Run the second one yourself before you promise the model to anyone else.

The number is real. The asterisk is the story.

Gemma 4 12B is a real milestone. A multimodal, open-weights model, decently capable, that a lot of people can run without a data centre, built on an architecture that is cleverer than the marketing bothers to explain. It deserves the attention it is getting. Reading the fine print is not an attempt to spoil it.

It is just that "any laptop with 16GB" is the kind of smooth, confident claim that outruns its own caveats, and the people passing it along rarely mention quantization, memory bandwidth, unified versus shared RAM, or what a 256K context costs in practice. So download it. Run it on the machine you actually own, not the one in the demo. Then you will know which 16GB you have. As always, the chart that says it runs anywhere is the one to read twice.