The science of large models

GPT-5 proposed the answer to a three-year-old biology problem. That is not the same as knowing it was right.

OpenAI says its model cracked an immunology mystery in minutes. The question that matters isn't whether it did — it's how often the plausible answer is wrong, and who is keeping count.

Amy Mercer·Jun 25·8 min

A scanning electron micrograph of a human T lymphocyte — the kind of immune cell at the centre of the dataset GPT-5 was asked to explain.

Image: NIAID (public domain)

A large language model cannot run an experiment. It cannot culture a cell, stain a sample, or read a flow cytometer; it has never been near a wet lab and never will be. What it can do is read a table of numbers a scientist already collected and say, in effect, here is a story that would explain these. Last week OpenAI published an account of one such story turning out to be right — GPT-5, it says, proposed the mechanism behind a problem an immunologist's lab had been stuck on for three years, in roughly the time it takes to read a long email. The account is almost certainly true. It is also, read carefully, a more modest event than the headline suggests, and the gap between the two is the part worth your attention.

The immunologist is Derya Unutmaz, at the Jackson Laboratory for Genomic Medicine, and the problem was real. Since 2022 his lab had been puzzling over a set of unpublished measurements: human CD4+ T cells — the immune cells that coordinate much of the body's response — behaving oddly when their metabolism was perturbed with glucose and with 2-deoxyglucose, a compound that jams the first step of how a cell burns sugar. Something in the data did not fit. Unutmaz gave GPT-5 Pro the flow-cytometry results and asked it to explain them. The model proposed that the effect came from disrupted N-linked glycosylation — the cell's process of decorating its proteins with sugar chains — during the window when the T cells were first primed, and that the population doing the heavy lifting was the memory T cells, not the naïve ones. According to OpenAI, the lab ran experiments and the hypothesis held.

It is worth understanding why that was a good answer, because the quality of the hypothesis is part of what impressed people. Glycosylation is not the obvious place to look; the easy explanation for a metabolic perturbation is that you have starved the cell of energy, and most analyses would chase that. The model's suggestion pointed somewhere subtler — that blocking the sugar-burning pathway also starved a side-process which uses the same sugars to build the molecular decorations a T cell needs as it commits to its fate, and that the cells already carrying a memory of past exposure were the ones whose decorations mattered. That is a specific, mechanistic, non-obvious claim, the kind a good immunologist might reach after weeks of thinking. Producing it from a data table in minutes is not nothing, and I want to be precise about that before I take anything away from it: the hypothesis was good.

But take the rest at face value too, because there is no reason not to. What GPT-5 did was generate a hypothesis — a plausible, specific, testable mechanism — from expert-curated data. That is a genuinely useful act, and anyone who has watched a project stall for years on a question no one could frame correctly will understand why Unutmaz was impressed. Notice, though, the shape of it. The model did not discover anything. It proposed something, and then humans in a lab did the discovering, in the old way, with reagents and controls and weeks of work. The word doing the heavy lifting in every retelling is confirmed, and confirmation did not come from the model. It came from the bench.

The frame was built by a human

Strip the story to its mechanics and a precondition appears at every step. A domain expert had assembled exactly the right dataset. He knew which question had gone unanswered, and why three years of prior analysis had failed to crack it. He could look at the model's answer and recognise it as biologically plausible rather than confident nonsense — and, crucially, he could tell which experiments would distinguish a right answer from a wrong one. The model operated inside a frame a human built and could not have built for itself. Change any one of those conditions — wrong data, ill-posed question, a non-expert unable to smell a wrong answer — and the same model produces the same fluent, plausible paragraph, with no internal signal that anything has gone awry. This is the property of these systems that matters most and gets discussed least: a model that proposes plausible mechanisms is, by construction, also a model that proposes plausible wrong mechanisms, and the text gives you no way to tell them apart.

A model that proposes plausible mechanisms is, by construction, also one that proposes plausible wrong mechanisms — and the words give you no way to tell which you are reading.

This is not a quibble; it is the entire epistemics of the thing. The value of a hypothesis generator is not whether it can produce a right answer — a fluent source of plausible mechanisms will produce right answers sometimes, by chance and by having been trained on a literature full of real ones. Its value is the base rate: how often the plausible answer it hands you survives the experiment, versus how often it sends a lab down a months-long blind alley that felt, going in, exactly as convincing as the success did. That number is the one that would tell you whether GPT-5 is a research accelerant or an expensive way to generate confident-sounding leads. OpenAI does not report it. It reports a success.

Who is keeping count

And that is the part to hold at arm's length — not because the immunology case is fabricated, I have no reason to think it is, but because of how we are learning about it. This account arrived as a company blog post, one in a small series OpenAI has published in recent weeks, each narrating GPT-5 cracking some long-standing problem: a decades-old question in mathematics here, a stalled drug-repurposing effort there. These are, in the literal sense, advertisements — selected, polished, and released by the firm that sells the model, at a moment when it is raising money and arguing for its own scientific importance. A vendor's curated successes are evidence of what the model can do at its best. They are silent, by design, on how often it does it. The published record of these systems in science is a highlight reel; the file drawer of plausible hypotheses that wasted a postdoc's spring does not get a blog post.

It also helps to remember that the underlying idea is not new, even if this model is better at it. "AI co-scientists" and automated hypothesis generators have been announced for years — systems built on earlier models, pitched with the same narrative of compressing a researcher's months into a machine's minutes. Some produced real leads; many produced plausible noise that quietly stopped being mentioned. What has changed with GPT-5 is fluency and breadth: it will engage with a flow-cytometry dataset and a topology proof with the same composure, and its wrong answers are correspondingly more convincing. That is an improvement in capability and, at the same time, an increase in the cost of being wrong about when to trust it. The better the prose, the more an unearned hypothesis looks like an earned one.

None of this means the tool is worthless. My own read, and I will mark its confidence as moderate, is that hypothesis generation from rich datasets is one of the more real near-term uses of these models, precisely because it slots into a process built to catch their errors. The wet lab is an error-correction machine; a hypothesis that has to survive an experiment is held to a standard the model is never held to on its own. Used that way — by an expert who can frame the question and filter the answers — a model that turns three years of stuck into a testable lead in an afternoon is genuinely worth having. The danger is not the immunologist who knows exactly what he is looking at. It is the version of this story that travels: that you can pour data into a model and pour discoveries out, and that the plausible paragraph is the finding rather than the first step toward one.

So the question worth asking is not the one the blog post answers — can a language model propose the mechanism behind a hard biological result? It plainly can; it just did. The question is the one no one selling these systems has an incentive to answer: across all the times an expert asks, how often is the plausible mechanism the right one, and who is measuring that rather than collecting the wins? Until someone publishes the denominator — the misses alongside the hits, the base rate rather than the anecdote — every one of these triumphant accounts is true and almost uninformative at the same time. A model proposed the right answer to a three-year-old problem. Knowing how impressed to be requires knowing how many times it proposed the wrong one with exactly the same fluency. That number exists. It is simply not the one anyone is publishing.

GPT-5 proposed the answer to a three-year-old biology problem. That is not the same as knowing it was right.

The frame was built by a human

Who is keeping count

References

Read next

Snap's $2,195 Specs put real AR on your face. I just can't tell you yet if they survive a Tuesday.

DeepSeek raised $7.4 billion on the cheapest models in AI. Read the footnotes on both numbers.

Two AI models are coming back online. The condition for their return cannot be met.

One email. Every Friday.