Specs Win in 2026. The Report Buries a Harder Truth.

I read Anthropic's 2026 Agentic Coding Trends Report twice this week, once on a plane and once at my kitchen table with a coffee going cold beside the laptop, and both times I nodded along with almost every page. Specs beat vibes. Structured intent beats a one-line prompt and a prayer, and the report backs that up with real numbers: teams shipping changes across a 12.5-million-line codebase in a single seven-hour run. I believe it, actually, all of it. But somewhere over Ohio, or maybe it was at the kitchen table, my brain drifted somewhere else entirely: an airport in Denver, in the 1990s, and a baggage system a very good engineering team built exactly to spec.

From the field

Denver International Airport wanted an automated baggage system that could route bags faster than any human crew ever could, built originally for United Airlines at one new concourse, ready for the airport's October 1993 opening, and it hired BAE Automated Systems to build it. They delivered. Real engineering, too: seventeen miles of track, four thousand independent telecars, more than a hundred networked computers keeping every one of them from colliding, compressing what one case study calls "the most complex baggage system ever built" into a two-year build (SEBoK, Denver Airport Baggage Handling System case study).

Then, mid-project, the city expanded the scope: not just United, the whole airport, every airline. Nobody went back to re-run discovery with the carriers who now depended on the same system. The engineering held. The bags didn't. Crushed, jammed, misrouted, because the specification never said what "every airline, every bag, every peak hour" actually meant once the design was already signed, sealed, and frozen. Denver opened sixteen months late, and the delay alone is commonly put at roughly $560 million on top of the system's own cost (University of Maryland, Denver Airport case study).

The spec wasn't bad. It was excellent, for a requirement that stopped being true the week the scope changed, and nobody went back to ask what that change actually required.

Anthropic's report doesn't have a Denver in it. Not one case study, not Rakuten, not Augment Code, not Fountain, three real companies doing genuinely impressive things with Claude Code, where an agent had to rescue a team from its own wrong requirement. Every single example starts the same way: a team that already knew exactly what needed building, so the agent never had to find the requirement, only execute one a human had already nailed down, precisely, before it opened a single file. Go check yourself if you don't believe me. I did. Twice, actually, because I assumed I'd missed one the first time through.

None of this is a knock on Anthropic, or on GitHub, or on any of the spec-driven tools racing to ship this quarter. Truly. It's just math, unglamorous math nobody wants to hear. A spec is a precise encoding of a decision somebody already made, nothing more mystical than that. Encode the wrong decision precisely enough, and an agent will build the wrong thing faster, cleaner, and with more confidence than any human ever could on a bad Tuesday, which is the part I keep having to remind myself of whenever a demo looks this good.

What did Anthropic's 2026 report actually prove about specs?

It proved specs work better than prompts, decisively, and it proved teams are moving fast because of it. Engineers now use AI in roughly 60% of their work, according to the report, and about 27% of that work is stuff nobody would have attempted without an agent doing the heavy lifting. That's real leverage. Spec Kit alone has become the default starting point for teams who got burned by vibe coding once and decided never again, and AWS Kiro, BMAD, Tessl, and Google's Antigravity all shipped their own flavor of the same idea inside the last year. The report's real finding, though, sits one layer under the headline, and it took me a second read to catch it: process discipline compounds. Judgment doesn't. Nobody's spec tool invented judgment a team didn't already have walking in the door.

60% / 0-20%

of engineering work now involves AI, yet engineers report being able to fully delegate only 0 to 20% of tasks without active supervision. The mechanics scaled. The judgment didn't move

Source: Anthropic, 2026 Agentic Coding Trends Report

I want to be precise about what that spread means, because it's easy to misread as a complaint about the tools. It isn't. Not even close. I'd say so if it were, I'm not shy about calling out a bad tool when I see one. A 12.5-million-line codebase run in seven hours is a real achievement, not a rounding error, and nobody at Specira wants agents to move slower. The point is narrower than that, and harder to dismiss: the report measures execution capacity exploding while delegation of judgment barely twitches, and those are two very different curves that happen to share one headline.

Why does a precisely written spec still ship the wrong thing?

Because a spec is a translation, not a discovery process. It takes a decision that already exists somewhere, in a stakeholder's head, in a policy nobody wrote down, in an assumption baked in during a hallway conversation three weeks before anyone opened Spec Kit, and turns it into something an agent can execute against. Translation can be flawless. The thing being translated can still be wrong.

A spec only encodes a decision somebody already made. Encode the wrong one precisely enough, and an agent will build it faster, cleaner, and more confidently than any human ever could.

What is the delegation gap really measuring?

It's measuring exactly where judgment still lives, and the number is not subtle. Anthropic's own researchers found engineers use AI in roughly 60% of their work, yet report being able to fully delegate only 0 to 20% of tasks without active supervision, prompting, and validation. That gap, forty points wide at its narrowest, is not a rounding error or a temporary tooling limitation. It's the space where somebody still has to decide what "correct" means before the agent can chase it.

Anthropic also found that about 27% of AI-assisted work is stuff nobody would have attempted at all without an agent doing the heavy lifting. Read that next to the delegation number and a pattern appears. Teams are attempting more, faster, with the same discovery bandwidth they had before. Nobody added headcount to the requirements-gathering side just because the coding side got a tenfold multiplier. More decisions get made per week. The scrutiny per decision didn't scale with it.

I used to think the fix was simply "slow down before you spec." That's not quite right, and saying it that way blames the wrong layer. Nobody wants slower agents, and nobody should. The fix is making the decision layer, the part that happens before Spec Kit or Kiro ever opens a file, as rigorous as the execution layer already became. Structured discovery that runs in days. Not months. Just real enough to catch the wrong assumption before anyone formalizes it.

What happens when execution outruns discovery?

Speed doesn't wait politely for judgment to catch up. When code generation is nearly instant and discovery still takes actual conversations with actual stakeholders, actual conflicting priorities, and actual exceptions nobody wrote down, the natural failure mode isn't "the agent writes bad code." It's "the team rushes the one step that was never automatable in the first place," and ships the resulting spec with total confidence. We already watched a version of this happen once, when the industry declared vibe coding dead and spec-driven development its replacement, and I wrote about why that framing was only half right in Vibe Coding Is Over. The Specs Are Still Broken. Moving from vibes to specs changed where the ambiguity lives. It didn't remove it.

Where does the quality ceiling actually sit?

It sits at discovery, full stop, and every layer built on top of it inherits whatever height it was given. Spec-driven tools raised the floor: fewer typos, fewer forgotten edge cases inside a given requirement, faster iteration once the decision is locked. None of that raises the ceiling, because the ceiling was never a formatting problem. I made this same argument back when the industry was busy calling AWS Kiro's own numbers a spec-quality win, in Your Specs Are the Problem. Not Your AI Agent, and Anthropic's 2026 data is the same lesson wearing a bigger, better-funded report.

Spec-driven tools formalized the middle stage. The first stage sets the ceiling, and no amount of formatting downstream raises it.

This is the layer Specira operates in: above the spec, before the plan, at the point where a requirement either gets validated by structured, multi-expert discovery or gets waved through because the deadline was Tuesday. Get that layer right and Spec Kit, Kiro, Claude Code, whatever tool wins next quarter, all inherit a ceiling worth building up to. Get it wrong, and you've just automated the fastest possible way to ship the wrong thing.

Specs won. The ceiling didn't move.

Anthropic's 2026 Agentic Coding Trends Report and the Spec Kit surge both prove structured, spec-first workflows beat ad-hoc prompting. Neither one claims to fix requirement quality, and the report's own numbers show why: engineers use AI in about 60% of their work but can fully delegate only 0 to 20% of tasks, the exact space where discovery and judgment still live.

A spec only encodes a decision somebody already made. Encode the wrong one precisely, and an agent ships it faster and with more confidence than a vague prompt ever could. The quality ceiling was never the spec format. It's set upstream, at discovery, which is exactly where Specira operates.

I'll be reading next year's version of this report too, and I expect it to say something similar: execution keeps getting faster, more autonomous, more impressive. Good. I hope it does. Just don't mistake a faster, more confident spec for a validated one, because the report proving specs win was never the same report proving your requirement was right, and pretending otherwise is how a good tool ships a bad decision at record speed.

What are the most common questions about spec-driven development's limits?

The report found that structured, spec-first workflows outperform ad-hoc prompting by wide margins, with teams completing changes across codebases as large as 12.5 million lines of code in hours instead of months. It also found engineers use AI in roughly 60% of their work but can fully delegate only 0 to 20% of tasks, and that about 27% of AI-assisted work would not have been attempted at all without it. The tooling improved. The report never claims it fixed requirement quality.

A spec formalizes a decision. It does not validate the decision. If the requirement behind the spec was wrong, incomplete, or built on an assumption nobody checked, a precisely written spec just carries that error into code faster and with more confidence than a vague prompt would have. Precision is not the same thing as correctness, and spec-driven tools were built to fix the first problem, not the second.

It is the spread between how much of their work engineers now do with AI, roughly 60%, and how much they can fully hand off without supervision, only 0 to 20%. That gap is where judgment, discovery, and requirement validation still live. Spec-driven development formalizes execution. It has not closed the delegation gap, and that gap is the clearest evidence that the decision layer above the spec still needs a human, or a system built specifically to do that discovery work.

They solve a different, real problem: turning an agreed decision into a structured, executable artifact instead of a vague chat prompt. That is valuable, and adoption shows teams know it. None of them are built to validate whether the underlying requirement is correct, complete, or still true after scope changes. AWS Kiro's own Requirements Analysis found roughly 60% of first-draft requirements needed refinement even inside a spec-first workflow, the tooling itself admitting the gap exists.

Spec format is structure: sections, acceptance criteria, EARS syntax, a plan an agent can parse. Spec quality is whether the decisions inside that structure are the right ones, validated against real stakeholders, real constraints, and real exceptions. Spec-driven tools standardized the format. Nothing about better formatting tells you the requirement inside it was ever correct in the first place.

Requirements intelligence sits upstream of the spec, running the structured discovery, multiple expert perspectives, stakeholder validation, and conflict surfacing, that determines whether the decision being formalized is the right one. Spec Kit, Kiro, and similar tools take over from there, turning a validated decision into a spec an agent can execute. One layer raises the ceiling. The other layer builds fast underneath it.

Nicolas Payette

CEO and Founder, Specira AI

Nicolas Payette has spent 25 years in enterprise software delivery, leading digital transformations at companies like Technology Evaluation Centers and Optimal Solutions. He founded Specira AI to solve the root cause of project failure: unclear requirements, not slow code.

What did Anthropic's 2026 report actually prove about specs?

Why does a precisely written spec still ship the wrong thing?

What is the delegation gap really measuring?

What happens when execution outruns discovery?

Where does the quality ceiling actually sit?

Specs won. The ceiling didn't move.

What are the most common questions about spec-driven development's limits?

Related reading

Your Specs Are the Problem. Not Your AI Agent.

Vibe Coding Is Over. The Specs Are Still Broken.

Raise the ceiling before you format the spec.