Six months ago, "vibe coding" was the meme that captured the moment — type a sentence, watch Claude or Cursor produce a working feature, ship it, repeat. It's fun. It's genuinely productive for prototypes. And it falls apart the second you put it inside a real codebase with real reviewers, real tests, and real users. We've spent the first half of 2026 watching teams collide with that wall — including a few we were brought in to clean up after. The pattern that actually works at scale has a name now: spec-driven development.
The shift in one line
Stop prompting for code. Start prompting for a spec, then let the agent generate the code, the tests AND the verification — and review the spec, not every line.
Why vibe coding stalls in real codebases
On a greenfield project of 5 files, you can hold the whole thing in your head and eyeball the diff. On a codebase of 500 files with 4 developers, three problems compound fast:
- Reviewers can't tell what the AI was trying to do, only what it did — so review either becomes rubber-stamping or rewriting from scratch
- Tests get written as an afterthought (or by the same prompt that wrote the code, which is its own circular trap)
- Architectural drift sets in within weeks — every feature is locally plausible, globally inconsistent
The fix isn't "use a smarter model." The fix is changing what the human writes and what the agent produces.
The spec-driven loop, in four steps
- 1Write a short spec — what changes, why, and what observable behaviour proves it works. 10–40 lines of markdown, not a 20-page PRD
- 2Agent drafts the plan — files to touch, tests to add, migration steps. Human approves or edits the plan, not the code
- 3Agent implements + runs verification — generates the code, runs the test suite, captures the output. If verification fails, agent iterates inside its own loop until it passes
- 4Human reviews the spec-to-diff delta — does the diff actually do what the spec promised? Anything outside the spec is a red flag
Notice what the human is doing: writing intent, approving plans, reviewing against intent. Notice what the human is not doing: typing implementation, debugging compile errors, hand-writing tests for trivial logic. That's where the productivity actually comes from in 2026 — not from typing faster, but from moving up a level of abstraction.
What a real spec looks like
Example spec (from a recent CruxBit engagement)
"Add per-organisation rate limiting to the public API. 1000 requests / minute / org, sliding window, Redis-backed. On limit hit: return 429 with Retry-After header. Must not affect requests on the authed dashboard endpoints. Verification: integration test that fires 1001 requests and asserts the 1001st returns 429; existing dashboard tests still pass."
That's the whole thing. Six sentences. The agent now has everything it needs to draft a plan, ask clarifying questions (which it usually does), implement, and verify. The reviewer has a fixed bar to measure the diff against — "does this implement the spec, and only the spec?"
Tooling that makes this practical
You don't need a new product to do spec-driven development — but a few things help enormously:
- An agent that can actually run the code — Claude Code, Cursor Agent, Devin, or your own [[mcp-explained-2026|MCP-wired]] setup. Verification-in-the-loop is the entire point
- A specs/ folder in the repo — every non-trivial change lands with the spec it was built from, in markdown, alongside the code. Future archaeology gets much easier
- A pre-merge checklist tied to the spec — "does the diff match the spec?" as an explicit reviewer step, ideally with the spec inline in the PR description
- Evals where it matters — for AI-built AI features especially, but also for any agent-heavy refactor: a small eval suite catches the "works locally, broken under real input" class of regression
What changes for reviewers
The honest part of this transition is that review gets weirder before it gets better. You stop reading every line and start reading the spec, scanning the diff for surprises, and trusting the test output. That feels reckless the first few times. The teams who push through it ship faster within a sprint or two. The teams who don't end up either rubber-stamping (which is worse than the old way) or re-typing the AI's output (which is just slow vibe coding).
Heuristic that works
If the diff contains anything you can't map back to a sentence in the spec, ask the agent to either remove it or amend the spec to justify it. Out-of-scope creep is the single biggest source of AI-generated tech debt.
Failure modes we see often
- 1Specs that are too vague — "improve the checkout flow" is a wish, not a spec. The agent will hallucinate scope and the reviewer has no fixed bar
- 2Specs without a verification clause — without "how do we know it worked," the agent has no termination condition and the human has no acceptance criterion
- 3Skipping the plan step — letting the agent jump from spec to diff hides the architectural choices. The plan is where the cheap iteration happens
- 4Letting the agent write the spec AND the code — circular grading. The spec is the human's job; only the human knows what the product is supposed to do
Where this lands by end of 2026
Our prediction, watching the tooling roadmaps: by Q4, "spec" becomes a first-class artifact in PR workflows the same way "description" is today. GitHub, Linear, Cursor and Claude Code are all converging on this — spec-in, code-out, with the spec preserved alongside the merged change. Teams that adopt the loop early will be reviewing 3x the PRs in the same time, with better defect rates, because the review surface area is smaller and more semantic.
TL;DR
- Vibe coding is great for demos, brittle in production codebases
- Spec-driven development = short markdown spec → plan → AI implements + verifies → human reviews diff vs spec
- Humans write intent and approve plans. Agents write code, tests and verification
- Keep specs in-repo alongside the code they shipped — future readers will thank you
- Out-of-scope diff content is the smell that tells you the loop is broken
- The productivity win is moving up an abstraction level, not typing faster
We've been rolling out spec-driven workflows for client teams since the start of the year. If you're trying to figure out how to get your engineers from vibe-coding-in-IDE to shipping-AI-code-at-team-scale, drop us a paragraph about your stack — we'll send back a candid take on the smallest change that gets you the biggest lift.