It's 2026. "AI agent" has gone from a hot research term to something every founder we talk to wants in their product. Most of them have seen the same flashy demo (an agent that books a flight, an agent that codes a feature) and want one of their own. After shipping roughly a dozen production agents in the last six months, here's what we've learned about which architectures survive real users and which collapse on contact.
The TL;DR
Pure autonomy is a trap. The agents that ship and stay shipped are narrow, instrumented, recoverable, and have a clear escape hatch to a human. Five patterns below.
1. The Single-Tool Specialist
The simplest pattern: one LLM call wrapped around one well-defined tool. A pricing agent that classifies a request and calls one of three pricing endpoints. A scheduling agent that proposes a time given calendar availability. No loop, no planning step, no multi-tool orchestration.
It feels too simple to call an "agent". It is. Which is why it works. Latency is predictable, costs are bounded, and failures are debuggable.
When to use
- The task has a single dominant action with parameter extraction
- Latency matters more than flexibility
- You're replacing a form or a rules engine, not orchestrating a workflow
2. The Tool-Picking Router
One LLM call decides which of N tools to invoke, with structured output enforcing the choice. The selected tool runs. A second LLM call (optional) summarises the result for the user. No looping back into the planner.
This is what most production "copilots" actually are. Cursor, Linear's AI, GitHub Copilot Chat — strip away the marketing and they're mostly routers with tight tool catalogs. The router is testable, evals are easy, hallucinations are bounded.
Eval tip
Build a hand-labelled dataset of 100–300 (input, correct tool, correct args) triples. Re-run on every prompt or model change. You'll catch regressions within minutes that would otherwise hit production.
3. The Bounded Loop Worker
An LLM in a loop with tools, but with three hard guards: a maximum step count (usually 6–8), a budget cap (token + dollar), and a clear stop condition (success / failure / escalation). No infinite agentic dreaming.
Most of the "AI agents" you actually want to ship in 2026 are this pattern. A support agent that resolves a ticket end-to-end. A data analyst that answers an ad-hoc business question. A code-fix agent that iterates on a failing test. The guardrails turn an exciting research demo into a system you can put your name on.
4. The Human-in-the-Loop Pair
The agent does 80% of the work; a human approves, edits or rejects before anything irreversible happens. Refund approvals. Outbound emails. Database migrations. Anything that's a one-way door.
The pattern looks like a productivity win, not a thrilling autonomous demo, which is why so many teams skip it and regret it three months later. The right metric for these agents is "time saved per task" multiplied by "approval rate", not "tasks completed autonomously".
5. The Multi-Agent Specialist Team
Multiple narrow agents (the patterns above) coordinated by a thin orchestrator. A research agent calls a synthesis agent calls a formatter agent. Each is independently testable; the orchestrator handles routing and failure.
This is the most overrated pattern. It feels architecturally elegant and is almost never the right first move. Build the single agent first. Add specialists only when you have measured a specific bottleneck the specialist actually fixes.
Patterns that consistently fail
- 1Free-roam multi-agent "teams" with no clear objective — cool demo, $400 cloud bill, no shipped feature
- 2Agents with mutable long-term memory — the memory drifts, the agent gets weird, you spend three weeks debugging
- 3Self-improving agents that rewrite their own prompts — sometimes works for a week then catastrophically diverges
- 4No telemetry — the agent silently degrades for a month before someone notices users hate it
What we instrument on every agent we ship
- Per-request cost in tokens and dollars, bucketed by user and feature
- Tool-call success rate — the number-one early signal of degradation
- Loop depth distribution — if 10% of requests hit max-steps, your guardrails are doing real work and your prompts need attention
- Escape-hatch rate — % of conversations that hand off to a human; trending up means the agent is losing ground
- End-task satisfaction — a single thumbs-up/down at the end of a multi-step session is worth more than fancy NPS
Picking the right pattern for your product
Start with pattern 1. If a single tool plus parameter extraction solves the task, ship it. Move to pattern 2 when there are clearly multiple distinct tools to choose between. Move to pattern 3 only when the task genuinely requires iteration. Pattern 4 for anything irreversible. Pattern 5 last, and only when measurements justify the complexity.
If you're scoping an agent and want a second opinion before you commit two months of engineering — send us a paragraph about the task and we'll point you at the smallest pattern that can do the job.