The METR Study One Year Later: When AI Actually Slows Developers

The METR Study One Year Later: When AI Actually Slows Developers

In early 2025, METR (Model Evaluation and Transparency Research) ran a randomized controlled trial that caught the industry off guard. Experienced open-source developers—people with years on mature, high-star repositories—were randomly assigned to complete real tasks either with AI tools (Cursor Pro with Claude) or without. The result: with AI, they took 19% longer to finish. Yet before the trial they expected AI to make them about 24% faster, and after it they believed they’d been about 20% faster. A 39-point gap between perception and reality.

A year later, that finding still shapes how we should think about AI adoption, especially for teams that aren’t seeing the performance benefits they expected. Here’s what the study actually showed and what to do with it.

What the METR Study Did

METR recruited 16 experienced developers (around 5 years of experience, working on real open-source projects with 22,000+ GitHub stars and 1M+ lines). Each completed 246 real-world tasks—bugs, refactors, enhancements—with tasks randomly assigned as “AI allowed” or “AI forbidden.” Participants were paid well ($150/hour) to avoid skew. So we’re not talking about toy tasks or casual users; we’re talking about serious work on serious codebases.

The tools in use were Cursor Pro with Claude 3.5/3.7 Sonnet—top-tier 2025 AI coding setups. So the result wasn’t “bad tools slow you down.” It was “even with good tools, on this kind of work, experienced developers were slower on average when they used AI.”

Why AI Slowed Them Down

The study and follow-up work point to a few mechanisms:

  • Prompting and iteration: Time spent crafting prompts, waiting for answers, and going back and forth.
  • Scrutinizing output: AI suggestions look plausible; checking whether they’re correct and fit the codebase takes time. In the study, developers accepted fewer than 44% of AI suggestions.
  • Fixing AI-generated code: When the model was wrong, debugging and correcting it added time.
  • Context mismatch: Large codebases carry a lot of implicit context—conventions, history, constraints. Models don’t have that. So suggestions often don’t fit, and fixing the fit costs time.

So the slowdown wasn’t mysterious. It was the cost of using a powerful but context-blind assistant on complex, context-heavy work.

The Perception Gap and Why It Matters

The striking part was the perception gap. Developers felt faster; the stopwatch said they were slower. That has direct implications for teams “not seeing benefits”:

  1. Self-report is unreliable. If you only ask “Does AI help?” you can get “yes” even when net time goes up. You need outcome metrics (cycle time, time to resolution, quality) to know what’s really happening.
  2. Feeling productive isn’t the same as being productive. The industry has leaned on “developers say they’re faster” to sell tools. The METR result is a reminder that we need to measure.
  3. Experienced developers may be the worst fit for generic AI on hard tasks. The study used experienced devs on substantial tasks. That’s exactly where AI might add overhead (prompting, verification, wrong abstractions) instead of reducing it. Juniors or simpler tasks might tell a different story—and later work has started to unpack task-type and experience effects.

What We’ve Learned Since

In the year since, a few patterns have become clearer:

  • Task fit matters more than tool. Where AI is used (boilerplate, docs, tests, well-scoped bugs vs. architecture, security, deep debugging) often matters more than which model or IDE you use. The METR tasks were real-world mix; breaking results down by task type would likely show AI helping on some and hurting on others.
  • Verification cost is central. If you don’t account for the time to review and fix AI output, you’ll overstate gains. Teams that “don’t see benefits” are often teams where verification is eating the gains.
  • Benchmarks and real work can diverge. Copilot and Cursor cite internal or vendor studies showing gains. METR used a different design (RCT, experienced devs, real repos). The takeaway isn’t that one side is “right”—it’s that context (who, what task, how you measure) determines whether AI speeds you up or slows you down.

What to Do on Your Team

If your team is struggling to see performance benefits from AI, the METR study suggests:

  1. Measure outcomes, not sentiment. Use cycle time, time-to-fix, escaped defects, or other concrete metrics. Compare periods or groups (e.g. AI-heavy vs. light) so you can see if AI is helping or hurting on balance.
  2. Segment by task and experience. Don’t assume “AI helps” or “AI hurts” globally. Try to see where it helps (e.g. docs, tests, routine changes) and where it doesn’t (e.g. complex design, security-sensitive code). Let experienced devs opt out or restrict AI on tasks where they’re faster without it.
  3. Treat verification as first-class. If METR-style slowdowns are from prompting + scrutiny + fixing, then improving verification (clear review rules, targeted testing, better prompts) is how you keep AI from canceling out its own gains.
  4. Don’t default to “use AI more.” For some people and some tasks, “use AI less” might be the right answer. The goal is better outcomes, not higher AI usage.

The METR study didn’t say “AI is bad.” It said: on this slice of the world (experienced devs, real OSS tasks, 2025 tools), AI made people slower on average, and they didn’t realize it. A year later, that’s still the right caution: measure, segment, and optimize for outcomes so your team can actually see when AI is helping—and when it isn’t.

Related Posts

Your AI-Generated Codebase Is a Liability
Development-PracticesTechnology-Strategy
Feb 14, 2026
7 minutes

Your AI-Generated Codebase Is a Liability

If a quarter of Y Combinator startups have codebases that are over 95% AI-generated, we should probably talk about what that means when those companies get acquired, audited, or sued.

AI-generated code looks clean. It follows conventions. It passes linting. It often has reasonable test coverage. By most surface-level metrics, it appears to be high-quality software.

But underneath the polished exterior, AI-generated codebases carry risks that traditional codebases don’t. Security vulnerabilities that look correct. Intellectual property questions that don’t have clear answers. Structural problems that emerge only under stress. Dependency chains that nobody consciously chose.

When AI Assistants Fail: The Meeting Scheduling Reality Check
Process-MethodologyIndustry-Insights
Jan 11, 2026
3 minutes

When AI Assistants Fail: The Meeting Scheduling Reality Check

I recently tried to use AI assistants to solve what should be a straightforward problem: scheduling a meeting with three other people at my office. We’re all Google Workspace users, so I figured this would be a perfect use case for AI—especially given all the hype about AI assistants being able to handle calendar management and scheduling.

Spoiler alert: both ChatGPT and Gemini failed spectacularly.

The ChatGPT Experience

I started with ChatGPT, thinking it would be able to help coordinate schedules. My request was simple: find a time that works for me and three colleagues for a meeting.

Getting Your Team Unstuck: A Manager's Guide to AI Adoption
Engineering-LeadershipProcess-Methodology
Feb 22, 2026
5 minutes

Getting Your Team Unstuck: A Manager's Guide to AI Adoption

You’ve got AI tools in place. You’ve encouraged the team to use them. But the feedback is lukewarm or negative: “We tried it.” “It’s not really faster.” “We don’t see the benefit.” As a manager, you’re stuck between leadership expecting ROI and a team that doesn’t feel it.

The way out isn’t to push harder or to give up. It’s to change how you’re leading the adoption: create safety to experiment, narrow the focus so wins are visible, and align incentives so that “seeing benefits” is something the team can actually achieve. This guide is for engineering managers whose teams are struggling to see any performance benefits from AI in their software engineering workflows—and who want to turn that around.