Cutting LLM spend: caching, batching, and smaller models | Harsh Mehta - AI Automation & Full-Stack Engineer

LLM cost scales roughly with tokens × price per million tokens. The levers below are the first I measure before redesigning architecture—often they cut spend 30–60% without a “big bang” rewrite.

1. Measure before optimizing

Track per route or per feature:

Prompt + completion tokens per successful user task (not per HTTP call if you chain calls).
p95 latency and error rate.

If you use Azure OpenAI, monitoring and diagnostics exposes token usage; OpenAI’s dashboard does similarly for direct API usage. OpenAI publishes pricing—compare gpt-4o vs gpt-4o-mini for your task.

2. Exact vs semantic caching

Approach	When it helps	Caveat
Exact cache (hash of full prompt)	Repeated identical prompts (templates, FAQs)	Misses on whitespace or dynamic dates
Cache of retrieved chunks (RAG)	Same documents asked many ways	Invalidate when source doc changes
Summary cache (hash of document content)	Long context reused across sessions	Stale summaries if doc edits

For RAG, cache embedding vectors keyed by hash(document_version) so you do not re-embed unchanged PDFs—see OpenAI embeddings guide.

Prompt caching (provider-side): OpenAI documents prompt caching for supported models—large static prefixes can see discounted cached input tokens. Verify eligibility and behavior for your region and model.

3. Batch when latency allows

Interactive chat rarely batches well. Offline jobs (tagging, summarization backfills) can use the Batch API where supported—often cheaper and more tolerant of rate limits.

4. Model routing

Use the smallest model that passes your eval bar for each subtask:

Router pattern: lightweight classifier or heuristic sends “easy” queries to gpt-4o-mini and “hard” ones to a larger model.
Structured extraction often works on smaller models if the schema is strict and you validate outputs.

Pair routing with the evaluation loop in evaluating Copilot-style features—routing is useless if quality regresses silently.

5. Prompt discipline = token discipline

Put instructions once in a system message; avoid repeating boilerplate per turn.
Prefer JSON or typed outputs with a strict schema—fewer repair retries. OpenAI documents structured outputs for supported models.
Trim irrelevant history in chat—sliding window or summarization of older turns.

Example (conceptual token saver):

# Avoid: pasting the same 2k-token policy in every user message
# Prefer: system message with policy + short user turns

6. Azure-specific: quotas and deployment SKUs

Throttling (429) causes retries and duplicated work—right-size deployment TPM/RPM per quotas and limits. Sometimes two smaller deployments out-perform one overloaded deployment.

References

OpenAI: Rate limits
Azure OpenAI: Model summary — regional availability and lifecycle
Anthropic prompt engineering overview — many principles transfer across vendors