AI Feature Cost Optimisation Playbook · 2026

If you've shipped any meaningful AI feature in 2025–2026, you've had the cost conversation. A demo costs cents. A real user with a real conversation costs dollars. A viral tweet costs a small mortgage. We've been called in to rescue half a dozen AI products in the last six months where the bill was eating the runway. Here's the playbook we walk every team through.

Up front

Cost optimisation is not premature optimisation when you're paying for tokens. Build the guardrails on day one. Refactoring an AI app to be cheap, after shipping, is far harder than building it cheap from the start.

1. Instrument cost from the first line of code

You cannot manage what you cannot measure. Every AI provider gives you per-request token counts in the response. Capture them. Tag every call with: user id, feature, model, prompt template id, output length. Pipe to your existing analytics (Mixpanel, Posthog, BigQuery) or a dedicated tool like Helicone.

Within a week of going live, this dataset will tell you the truth about where your money is going — usually concentrated in 5–10% of users running 50–80% of the cost.

2. Cache aggressively

The cheapest token is the one you don't send. Three layers:

Provider prompt caching — OpenAI, Anthropic and Google all support caching the static portion of a prompt (system prompt + retrieval context). Savings of 50–90% on the cached prefix, with sub-millisecond cache hits
Response caching — deterministic queries (same input, same output expected) should never re-hit the model. Use semantic-similarity caching for near-duplicates
Embedding caching — hash the text, store the embedding. Embeddings are cheap individually and devastating in aggregate when recomputed

3. Route by complexity, not by "the best model"

Don't send a one-line classification to your $$$$$ frontier model. Don't send your hardest reasoning task to your cheapest model either. Build a tiny router that picks model by task complexity:

1Classification, formatting, simple extraction → small fast model (Haiku, GPT-mini, Gemini Flash)
2Standard chat, summarisation, drafting → mid-tier (Sonnet, GPT, Gemini)
3Complex reasoning, code, long-context multi-step → frontier (Opus, GPT Pro, Gemini Ultra)

We routinely see 30–50% cost reductions from this one move alone, with no perceptible drop in user-facing quality.

4. Set hard budget caps, per user and per endpoint

Real story

A seed-stage SaaS we helped had an AI summarisation feature with no per-user cap. One user wrote a script that hit it 40,000 times overnight (genuine — they wanted summaries of every paper on arXiv). $3,800 in a single evening. Caps would have caught it at $20.

Soft limits warn users approaching their cap. Hard limits return a friendly error with an upgrade path. Both go in on day one. The math is too painful otherwise.

5. RAG over long context windows (almost always)

Modern frontier models offer 1M+ token context windows. Tempting to just stuff the entire knowledge base in. Wrong move — you pay for every token on every request, and the model's ability to use information degrades the further from the start it sits.

Retrieval-augmented generation pulls only the relevant 2–8 chunks per query. Cost goes down by an order of magnitude. Quality goes up because the model is focused. Latency goes down because there are fewer tokens to process.

6. Stream everything, render progressively

Streaming doesn't change your token spend, but it changes the experience users tolerate. Users will wait 8 seconds for streaming output but abandon after 3 seconds of spinner. Spinner-on-spinner UX gets you support tickets that cost more to resolve than the inference itself.

7. Pre-compute and warm-cache for predictable workloads

If your nightly job summarises every new ticket from the day, do it at 3am when the queue is empty, not at 9am when your support team logs in. If your "weekly digest" email needs an AI summary, generate it once and email it to N users, not N times.

8. Distill (when, and only when, it pays off)

For very high-volume narrow tasks (classifying 10M emails/day; extracting fields from 100K invoices/week), fine-tuning a small model on outputs from a big one can drop per-call cost 10–40x. Not worth it for most teams. Hugely worth it for the right shape of problem.

Patterns that don't actually save money

Switching to open-source models prematurely — hosting overhead and engineering time often dominate token savings
Compressing prompts beyond the point of clarity — you save 20% in input tokens and double your error rate
Building "cheap mode" toggles that users never select — most users default to whatever's easiest
Off-loading to a cheaper model for everything — quality drops, retention drops, you save money on a product nobody uses

Our 2026 default architecture for cost-aware AI features

1Provider prompt caching on every static system prompt
2Per-user + per-endpoint budget caps as middleware
3Three-tier model routing by task complexity
4Response and embedding caches in Redis with sensible TTLs
5RAG by default; large context only when measurably better
6Full per-request observability (cost, latency, model, tokens, success) from day one
7Cost dashboards reviewed weekly, not when the invoice shocks someone

TL;DR

Measure cost per request from day one, tagged by user and feature
Cache: prompt, response, embedding
Route by complexity — don't send simple tasks to expensive models
Hard budget caps per user and per endpoint
RAG beats stuffing the full knowledge base into the context window for almost everyone
Stream output; perceived latency is half the battle
Open-source / fine-tuning are the LAST levers, not the first

We rescue AI features that have outgrown their budget for a living. If your AI bill is climbing faster than your usage — or you're scoping a new feature and want to ship it cheap from day one — drop us a line. We'll send back a candid take, free.

#AI#Cost#Optimisation#LLM

Back to all posts

Up next

Healthcare

Building something we've just written about?

Drop us a line. We respond within 24 hours with a candid, no-pressure take on whether we're the right partner.

Start a conversation Read more posts

Shipping AI features without burning cash: a 2026 cost playbook

1. Instrument cost from the first line of code

2. Cache aggressively

3. Route by complexity, not by "the best model"

4. Set hard budget caps, per user and per endpoint

5. RAG over long context windows (almost always)

6. Stream everything, render progressively

7. Pre-compute and warm-cache for predictable workloads

8. Distill (when, and only when, it pays off)

Patterns that don't actually save money

Our 2026 default architecture for cost-aware AI features

TL;DR

Up next

Breaking the Bottleneck: Why AI Infrastructure Must Precede Better Patient Outcomes

Cracked the Code: How I Bagged a ₹50,000/Month Tech Internship

Why Every Indian Startup Should Care About Database Indexing Before They Scale

Building something we've just written about?

1. Instrument cost from the first line of code

2. Cache aggressively

3. Route by complexity, not by &quot;the best model&quot;

4. Set hard budget caps, per user and per endpoint

5. RAG over long context windows (almost always)

6. Stream everything, render progressively

7. Pre-compute and warm-cache for predictable workloads

8. Distill (when, and only when, it pays off)

Patterns that don&apos;t actually save money

Our 2026 default architecture for cost-aware AI features

TL;DR

Up next

Breaking the Bottleneck: Why AI Infrastructure Must Precede Better Patient Outcomes

Cracked the Code: How I Bagged a ₹50,000/Month Tech Internship

Why Every Indian Startup Should Care About Database Indexing Before They Scale

Building something we've just written about?

3. Route by complexity, not by "the best model"

Patterns that don't actually save money