If you've shipped any meaningful AI feature in 2025–2026, you've had the cost conversation. A demo costs cents. A real user with a real conversation costs dollars. A viral tweet costs a small mortgage. We've been called in to rescue half a dozen AI products in the last six months where the bill was eating the runway. Here's the playbook we walk every team through.
Up front
Cost optimisation is not premature optimisation when you're paying for tokens. Build the guardrails on day one. Refactoring an AI app to be cheap, after shipping, is far harder than building it cheap from the start.
1. Instrument cost from the first line of code
You cannot manage what you cannot measure. Every AI provider gives you per-request token counts in the response. Capture them. Tag every call with: user id, feature, model, prompt template id, output length. Pipe to your existing analytics (Mixpanel, Posthog, BigQuery) or a dedicated tool like Helicone.
Within a week of going live, this dataset will tell you the truth about where your money is going — usually concentrated in 5–10% of users running 50–80% of the cost.
2. Cache aggressively
The cheapest token is the one you don't send. Three layers:
- Provider prompt caching — OpenAI, Anthropic and Google all support caching the static portion of a prompt (system prompt + retrieval context). Savings of 50–90% on the cached prefix, with sub-millisecond cache hits
- Response caching — deterministic queries (same input, same output expected) should never re-hit the model. Use semantic-similarity caching for near-duplicates
- Embedding caching — hash the text, store the embedding. Embeddings are cheap individually and devastating in aggregate when recomputed
3. Route by complexity, not by "the best model"
Don't send a one-line classification to your $$$$$ frontier model. Don't send your hardest reasoning task to your cheapest model either. Build a tiny router that picks model by task complexity:
- 1Classification, formatting, simple extraction → small fast model (Haiku, GPT-mini, Gemini Flash)
- 2Standard chat, summarisation, drafting → mid-tier (Sonnet, GPT, Gemini)
- 3Complex reasoning, code, long-context multi-step → frontier (Opus, GPT Pro, Gemini Ultra)
We routinely see 30–50% cost reductions from this one move alone, with no perceptible drop in user-facing quality.
4. Set hard budget caps, per user and per endpoint
Real story
A seed-stage SaaS we helped had an AI summarisation feature with no per-user cap. One user wrote a script that hit it 40,000 times overnight (genuine — they wanted summaries of every paper on arXiv). $3,800 in a single evening. Caps would have caught it at $20.
Soft limits warn users approaching their cap. Hard limits return a friendly error with an upgrade path. Both go in on day one. The math is too painful otherwise.
5. RAG over long context windows (almost always)
Modern frontier models offer 1M+ token context windows. Tempting to just stuff the entire knowledge base in. Wrong move — you pay for every token on every request, and the model's ability to use information degrades the further from the start it sits.
Retrieval-augmented generation pulls only the relevant 2–8 chunks per query. Cost goes down by an order of magnitude. Quality goes up because the model is focused. Latency goes down because there are fewer tokens to process.
6. Stream everything, render progressively
Streaming doesn't change your token spend, but it changes the experience users tolerate. Users will wait 8 seconds for streaming output but abandon after 3 seconds of spinner. Spinner-on-spinner UX gets you support tickets that cost more to resolve than the inference itself.
7. Pre-compute and warm-cache for predictable workloads
If your nightly job summarises every new ticket from the day, do it at 3am when the queue is empty, not at 9am when your support team logs in. If your "weekly digest" email needs an AI summary, generate it once and email it to N users, not N times.
8. Distill (when, and only when, it pays off)
For very high-volume narrow tasks (classifying 10M emails/day; extracting fields from 100K invoices/week), fine-tuning a small model on outputs from a big one can drop per-call cost 10–40x. Not worth it for most teams. Hugely worth it for the right shape of problem.
Patterns that don't actually save money
- Switching to open-source models prematurely — hosting overhead and engineering time often dominate token savings
- Compressing prompts beyond the point of clarity — you save 20% in input tokens and double your error rate
- Building "cheap mode" toggles that users never select — most users default to whatever's easiest
- Off-loading to a cheaper model for everything — quality drops, retention drops, you save money on a product nobody uses
Our 2026 default architecture for cost-aware AI features
- 1Provider prompt caching on every static system prompt
- 2Per-user + per-endpoint budget caps as middleware
- 3Three-tier model routing by task complexity
- 4Response and embedding caches in Redis with sensible TTLs
- 5RAG by default; large context only when measurably better
- 6Full per-request observability (cost, latency, model, tokens, success) from day one
- 7Cost dashboards reviewed weekly, not when the invoice shocks someone
TL;DR
- Measure cost per request from day one, tagged by user and feature
- Cache: prompt, response, embedding
- Route by complexity — don't send simple tasks to expensive models
- Hard budget caps per user and per endpoint
- RAG beats stuffing the full knowledge base into the context window for almost everyone
- Stream output; perceived latency is half the battle
- Open-source / fine-tuning are the LAST levers, not the first
We rescue AI features that have outgrown their budget for a living. If your AI bill is climbing faster than your usage — or you're scoping a new feature and want to ship it cheap from day one — drop us a line. We'll send back a candid take, free.