The METR Study One Year Later: When AI Actually Slows Developers

5 minutes - Feb 23, 2026
#ai#productivity#research#metr#developer-experience

In early 2025, METR (Model Evaluation and Transparency Research) ran a randomized controlled trial that caught the industry off guard. Experienced open-source developers—people with years on mature, high-star repositories—were randomly assigned to complete real tasks either with AI tools (Cursor Pro with Claude) or without. The result: with AI, they took 19% longer to finish. Yet before the trial they expected AI to make them about 24% faster, and after it they believed they’d been about 20% faster. A 39-point gap between perception and reality.

A year later, that finding still shapes how we should think about AI adoption, especially for teams that aren’t seeing the performance benefits they expected. Here’s what the study actually showed and what to do with it.

What the METR Study Did

METR recruited 16 experienced developers (around 5 years of experience, working on real open-source projects with 22,000+ GitHub stars and 1M+ lines). Each completed 246 real-world tasks—bugs, refactors, enhancements—with tasks randomly assigned as “AI allowed” or “AI forbidden.” Participants were paid well ($150/hour) to avoid skew. So we’re not talking about toy tasks or casual users; we’re talking about serious work on serious codebases.

The tools in use were Cursor Pro with Claude 3.5/3.7 Sonnet—top-tier 2025 AI coding setups. So the result wasn’t “bad tools slow you down.” It was “even with good tools, on this kind of work, experienced developers were slower on average when they used AI.”

Why AI Slowed Them Down

The study and follow-up work point to a few mechanisms:

Prompting and iteration: Time spent crafting prompts, waiting for answers, and going back and forth.
Scrutinizing output: AI suggestions look plausible; checking whether they’re correct and fit the codebase takes time. In the study, developers accepted fewer than 44% of AI suggestions.
Fixing AI-generated code: When the model was wrong, debugging and correcting it added time.
Context mismatch: Large codebases carry a lot of implicit context—conventions, history, constraints. Models don’t have that. So suggestions often don’t fit, and fixing the fit costs time.

So the slowdown wasn’t mysterious. It was the cost of using a powerful but context-blind assistant on complex, context-heavy work.

The Perception Gap and Why It Matters

The striking part was the perception gap. Developers felt faster; the stopwatch said they were slower. That has direct implications for teams “not seeing benefits”:

Self-report is unreliable. If you only ask “Does AI help?” you can get “yes” even when net time goes up. You need outcome metrics (cycle time, time to resolution, quality) to know what’s really happening.
Feeling productive isn’t the same as being productive. The industry has leaned on “developers say they’re faster” to sell tools. The METR result is a reminder that we need to measure.
Experienced developers may be the worst fit for generic AI on hard tasks. The study used experienced devs on substantial tasks. That’s exactly where AI might add overhead (prompting, verification, wrong abstractions) instead of reducing it. Juniors or simpler tasks might tell a different story—and later work has started to unpack task-type and experience effects.

What We’ve Learned Since

In the year since, a few patterns have become clearer:

Task fit matters more than tool. Where AI is used (boilerplate, docs, tests, well-scoped bugs vs. architecture, security, deep debugging) often matters more than which model or IDE you use. The METR tasks were real-world mix; breaking results down by task type would likely show AI helping on some and hurting on others.
Verification cost is central. If you don’t account for the time to review and fix AI output, you’ll overstate gains. Teams that “don’t see benefits” are often teams where verification is eating the gains.
Benchmarks and real work can diverge. Copilot and Cursor cite internal or vendor studies showing gains. METR used a different design (RCT, experienced devs, real repos). The takeaway isn’t that one side is “right”—it’s that context (who, what task, how you measure) determines whether AI speeds you up or slows you down.

What to Do on Your Team

If your team is struggling to see performance benefits from AI, the METR study suggests:

Measure outcomes, not sentiment. Use cycle time, time-to-fix, escaped defects, or other concrete metrics. Compare periods or groups (e.g. AI-heavy vs. light) so you can see if AI is helping or hurting on balance.
Segment by task and experience. Don’t assume “AI helps” or “AI hurts” globally. Try to see where it helps (e.g. docs, tests, routine changes) and where it doesn’t (e.g. complex design, security-sensitive code). Let experienced devs opt out or restrict AI on tasks where they’re faster without it.
Treat verification as first-class. If METR-style slowdowns are from prompting + scrutiny + fixing, then improving verification (clear review rules, targeted testing, better prompts) is how you keep AI from canceling out its own gains.
Don’t default to “use AI more.” For some people and some tasks, “use AI less” might be the right answer. The goal is better outcomes, not higher AI usage.

The METR study didn’t say “AI is bad.” It said: on this slice of the world (experienced devs, real OSS tasks, 2025 tools), AI made people slower on average, and they didn’t realize it. A year later, that’s still the right caution: measure, segment, and optimize for outcomes so your team can actually see when AI is helping—and when it isn’t.

The METR Study One Year Later: When AI Actually Slows Developers

What the METR Study Did

Why AI Slowed Them Down

The Perception Gap and Why It Matters

What We’ve Learned Since

What to Do on Your Team

Tags:

Related Posts

Your AI-Generated Codebase Is a Liability

The 32% Problem: Why Most Engineering Orgs Are Flying Blind on AI Governance

OpenClaw for Engineering Teams: Beyond Chatbots