Agentic Coding in Production · 2026 Field Guide

Six months into 2026, the question we get from engineering leads has changed. It used to be "should we let AI write code?" — that argument is over, everyone is. Now it's "how do we let agents write code without it blowing up in production?" That's a much harder question, and the honest answer is that most teams are improvising. They've got an agent merrily editing files and no real process around it. This post is the process — the actual loop we run on paid client work, end to end, with the guardrails that keep it from going sideways.

Who this is for

Engineering leads and founders who already use a coding agent (Claude, Codex, Gemini, Cursor — pick your poison) for prototypes and now want to put it on a real codebase with real reviewers, real tests and real users. If you're still at the "is this allowed" stage, read our vibe coding and spec-driven posts first; this one assumes you're past that.

The shift: from autocomplete to orchestration

The first wave of AI coding was autocomplete — it finished your line. The second wave was chat — you asked, it answered. What landed in 2026 is the third wave: agents that plan a task, edit multiple files, run the tests, read the failures, and fix themselves before they ever show you a diff. The model isn't a smarter autocomplete anymore; it's a junior engineer who works fast, never gets tired, and will confidently ship something subtly wrong if you let it.

That last clause is the whole game. An agent that writes 800 lines in four minutes is only an asset if the 800 lines are correct, reviewable and reversible. Everything below exists to make that true.

“Treating an agent like a faster typist is how you generate technical debt at machine speed. Treat it like a teammate who needs a spec, a sandbox and a code review.”

The loop: plan → execute → verify → review

Every reliable agentic workflow we've seen — ours and our clients' — collapses to the same four stages. The tools differ. The loop doesn't.

The agentic coding loop

  ┌──────────┐     ┌───────────┐     ┌──────────┐     ┌──────────┐
  │  1. PLAN  │ ──▶ │ 2. EXECUTE │ ──▶ │ 3. VERIFY │ ──▶ │ 4. REVIEW │
  │  (human + │     │  (agent in │     │ (CI: tests│     │  (human   │
  │   agent)  │     │  a sandbox)│     │  + scans) │     │  on-loop) │
  └──────────┘     └───────────┘     └──────────┘     └──────────┘
        ▲                                   │                  │
        └───────────── fix & retry ◀────────┘                  │
        └───────────────── reject / re-spec ◀──────────────────┘

Stage 1 — Plan (human sets intent, agent drafts the plan)

Never let an agent jump straight to code on anything non-trivial. Have it produce a plan first: which files it'll touch, what the approach is, what it's explicitly NOT doing. You read the plan — not the code — and correct the intent there. Catching a wrong assumption in a five-line plan costs seconds; catching it in 600 lines of finished code costs an afternoon.

Stage 2 — Execute (agent works in a sandbox)

The agent edits files and runs tests in an isolated environment — a branch, a worktree, a container — never directly on main, never with production credentials in scope. It self-corrects against the test output in a loop until it's green or stuck. This is where the speed comes from, and it's safe precisely because it's sandboxed.

Stage 3 — Verify (the same CI gate as humans)

Agent-written code goes through the exact same CI as human-written code — and then some. Same linters, same type checks, same test suite, same security scans. If anything, agents need stricter gates, because they'll happily delete a failing test to make the suite pass if your prompt was sloppy. Make the CI the source of truth, not the agent's claim that it's done.

Stage 4 — Review (a human stays on the loop)

Note: on the loop, not in the loop. You're not approving every keystroke — you're reviewing the finished, CI-green diff against the original intent before it merges. The right question at review isn't "is each line correct" (CI handles a lot of that), it's "did this solve the actual problem, and did it quietly do anything I didn't ask for?"

AGENTS.md: the file that does the most work

The single highest-leverage thing you can do this quarter is write a good AGENTS.md. It's become the de facto standard in 2026 — a plain markdown file at the root of your repo (and optionally per-package in a monorepo) that tells any coding agent how your codebase works: conventions, commands, do's and don'ts. Every major agent reads it. It is the difference between an agent that fits your codebase and one that fights it.

AGENTS.md

markdown

# Project: Acme Dashboard

## Stack
- Next.js (App Router), TypeScript strict, Tailwind
- Postgres via Prisma. NEVER write raw SQL — use the Prisma client.

## Commands
- Install: `pnpm install`
- Dev: `pnpm dev`
- Test (must pass before any PR): `pnpm test`
- Typecheck (must pass): `pnpm typecheck`
- Lint + format: `pnpm lint --fix`

## Conventions
- Components in `src/components`, one per file, PascalCase.
- Server actions go in `src/app/**/actions.ts`, never in components.
- All money is stored in paise (integers). Never use floats for currency.

## Hard rules
- Do NOT edit files in `src/generated/` — they are codegen output.
- Do NOT add a dependency without flagging it in the PR description.
- Do NOT delete or skip a failing test to make CI pass. Fix the cause.

Why this beats a clever prompt

A prompt is gone after one task. AGENTS.md is read on every task, by every agent, forever. Put your conventions in the repo, not in your head — the same way a good onboarding doc makes a new human hire productive on day one instead of week three.

Architecture: separate reasoning, action and validation

When you move past single tasks into actual agentic features — agents running in your product, not just in your editor — the biggest mistake is cramming all the logic into one giant prompt. Brittle, untestable, impossible to debug. The pattern that holds up splits the system into three layers:

Reasoning — the model decides what to do. Keep this layer thin and swappable; this is the part you'll want to upgrade as new models ship (and in 2026 they ship monthly).
Action — the tools/APIs the agent can call. Each tool is a normal function with normal validation. The agent doesn't get raw database access; it gets a small, audited set of capabilities.
Validation — a deterministic check on every action before it commits. Not the model marking its own homework — real code that verifies the output is sane.

Split this way, you can swap the model without rewriting your tools, test each tool in isolation, and trace exactly where a bad output came from. Glue it all into one prompt and you can do none of those things.

Observability is not optional anymore

An agent in production is a non-deterministic system making decisions on live data. If you can't see what it's doing, you're flying blind. Treat AI observability as a first-class DevOps function, the same way you treat APM. At minimum, log every one of these:

1The full prompt and context sent to the model (redact secrets) — you cannot debug what you can't see.
2Every tool call the agent made, with arguments and results.
3Token count and cost per request, tagged by user and feature — so a runaway loop shows up as a cost spike before it shows up as a bill.
4Latency per step — agent loops fail slowly; you want to catch the one that's spiralling.
5A trace ID that ties the whole run together, so a support ticket maps to an exact run you can replay.

The cost trap

Agentic loops fire many model calls per task. A bug that adds one extra reflection step can quietly double your bill. We've walked into more than one client whose AI feature was "working fine" and costing 4x what it should because nobody was watching cost per run. Hard budget caps per user and per endpoint, from day one.

Security: the new attack surface

Agents that take actions are a genuinely new security surface, and most teams under-think it. The two we hammer on with clients:

Prompt injection — if your agent reads untrusted input (a user's file, a web page, an email), assume that input will try to hijack it. Never give an agent both untrusted input AND high-privilege tools in the same context. Separate the concerns.
Least privilege, always — the agent gets the narrowest set of tools and credentials that the task needs, scoped and time-limited. No standing admin access. If a tool can delete data, it needs a confirmation gate or a dry-run mode.
Audit everything — every action an agent takes should be attributable and reversible. "The AI did it" is not an acceptable line in an incident review.

The failure modes nobody warns you about

These are the ones that actually bite, drawn from real cleanups:

Confidently wrong, beautifully formatted. The output looks polished, reads well, and is subtly incorrect. Polish is not correctness — your tests are. This is why the CI gate is non-negotiable.
Silent scope creep. You asked it to fix a bug; it also "helpfully" refactored three other files. Always review the full diff, not just the part you expected to change.
Test deletion to go green. A sloppy "make the tests pass" prompt invites the agent to delete the failing test. Put the hard rule in AGENTS.md and watch for shrinking test counts in review.
Dependency sprawl. Agents reach for a new package rather than ten lines of code. Require new deps to be flagged in the PR and justify their weight.
Context rot on long tasks. The longer the run, the more the agent forgets the early constraints. Break big tasks into smaller, verifiable chunks rather than one heroic mega-prompt.

A pragmatic rollout for a real team

If you're introducing this to a team that ships to production, don't flip everything on at once. The order we recommend:

1Week 1 — Write a real AGENTS.md and wire your existing CI as the gate. No new tools, just structure.
2Week 2 — Use agents on low-risk, well-tested areas: refactors, test coverage, docs, migrations with good rollback. Build trust where mistakes are cheap.
3Week 3-4 — Expand to feature work behind the same plan → execute → verify → review loop, human on the loop at review.
4Ongoing — Add observability before, not after, you put any agent in front of users. Set budget caps. Review what the agents actually did monthly and tighten AGENTS.md as you learn.

“The teams that look calm about AI coding in 2026 aren't using better models than you. They built the loop — the spec, the gate, the review — before they scaled the speed.”

The bottom line

Agentic coding is the most leverage we've had as builders in a decade, and it's also the easiest way yet to generate a mess at scale. The difference is entirely in the harness around the model — the plan you make it write, the sandbox you run it in, the CI that gates it, the human who reviews intent, and the observability that lets you sleep at night. The model is the engine. This is the chassis, the brakes and the seatbelt.

We do this for a living — standing up agentic workflows on real codebases, and cleaning up after the ones that went in without a harness. If your team is somewhere on this curve and wants a candid second opinion on your setup, send us what you've got. We'll send back an honest read, no pitch, free.

#AI Coding#Agents#Workflow#Production#Architecture

Back to all posts

Agentic coding in production: a 2026 field guide