Why AI Lab Gains Don't Land on Your P&L

Noy and Zhang published the number in Science three years ago. ChatGPT cut writing time by 40 percent. Dell'Acqua and the BCG-Harvard team reported 40 percent quality gains on marketing tasks. Peng and colleagues measured 55.8 percent time saved on coding with GitHub Copilot. Those numbers powered three years of conference decks, board memos, and capex justifications. They also measure something different from what shows up on a finance team's P&L.

When researchers stopped timing isolated tasks and started timing whole workweeks, the numbers collapsed by more than an order of magnitude. Bick, Blandin, and Deming ran a nationally representative US survey in 2024. About 40 percent of adults had used generative AI. Users at work reported saving roughly 2.2 percent of weekly hours. Humlum and Vestergaard's Denmark panel, which matched worker surveys to administrative payroll records, came in at 2.8 percent with null earnings effects over two years. Those are among the cleanest identifications in the field literature.

The standard read is diffusion lag. Tools are new, workers haven't mastered them, firms haven't redesigned around them. Wait five years. Numbers converge. MIT Project NANDA leans on this frame when it reports that 95 percent of enterprise generative AI pilots produce no measurable business impact. McKinsey uses a version of it to explain how 72 percent of firms can use AI in at least one function while only 10 percent see material earnings impact. The assumption running through both: the gap closes itself as adoption matures.

There is some truth in diffusion lag. Workflow redesign matters. Line-manager involvement matters. The 5 percent of NANDA pilots that did work succeeded by building around the tool. Tool selection was the secondary variable. What diffusion lag cannot explain is the full shape of the spread. The experimental studies measured the slice of knowledge work the tool is good at. The field studies measured the whole job. A mid-level analyst or associate spends most of the week figuring out what the stakeholder actually wants, reading prior documents, reconciling three different versions of the same analysis, and rewriting the same memo for three different audiences. Only a piece of the week is the output task the lab timed.

Run the arithmetic. If the tool saves 40 percent of the output task, and the output task is a fifth of the week, the week-level saving is about eight percent. Peel off another fifth of the week where the tool does not help at all, and you're down to five or six. Keep peeling. You land in the two to three percent range the field surveys are finding. That coherence is the structure showing through. The field surveys and the macro math are describing the same workweek from two different altitudes.

Daron Acemoglu's macro estimate lands at roughly 0.7 percent TFP growth over a decade. That implies field-scale productivity, not lab-scale. Acemoglu is making a narrower claim than the headlines suggest. He is weighting every task in the economy by its exposure to AI, applying a plausible task-level productivity gain, and aggregating. The number he gets coheres with the field surveys. A fresh NBER forecasting tournament led by Karger, Tetlock, and coauthors (w35046, April 2026) lands a median US GDP growth of 2.5 percent through 2030, with a rapid-AI scenario of 4 percent. Goldman's 7 percent stays available as a ceiling. The defensible range is one to seven, not any single number.

For a finance team, the implication is operational. A capex program justified off 40 percent productivity gains is solving for a task, not a P&L line. A headcount plan modeled on the same assumption is solving for the same mismatch. The right question is narrower. What is the slice of the work where the tool lands clean, how do you route that slice through your team, and what do you do with the 80 percent of the week that sits outside it? The 5 percent of NANDA winners all had a version of that answer. The 95 percent did not.

Capability may close the gap over time. Agent-style systems that can sit through a stakeholder meeting, draft three differently-framed variants, and run edits from feedback would pull field numbers closer to lab numbers. Anthropic's March 2026 Learning Curves release shows automation rising toward parity with augmentation across occupations. That is the mechanism worth watching over the next two years. The gap closes when the tool absorbs more of the week. Broader adoption alone does not move the field number. Plan the capex accordingly.