Cutting LLM spend: caching, batching, and smaller models
Semantic vs exact cache, prompt compression, routing to smaller models, batch APIs, and metrics—with OpenAI and Azure references.
LLM cost scales roughly with tokens × price per million tokens. The levers below are the first I measure before redesigning architecture—often they cut spend 30–60% without a “big bang” rewrite.
1. Measure before optimizing
Track per route or per feature:
- Prompt + completion tokens per successful user task (not per HTTP call if you chain calls).
- p95 latency and error rate.
If you use Azure OpenAI, monitoring and diagnostics exposes token usage; OpenAI’s dashboard does similarly for direct API usage. OpenAI publishes pricing—compare gpt-4o vs gpt-4o-mini for your task.
2. Exact vs semantic caching
| Approach | When it helps | Caveat |
|---|---|---|
| Exact cache (hash of full prompt) | Repeated identical prompts (templates, FAQs) | Misses on whitespace or dynamic dates |
| Cache of retrieved chunks (RAG) | Same documents asked many ways | Invalidate when source doc changes |
| Summary cache (hash of document content) | Long context reused across sessions | Stale summaries if doc edits |
For RAG, cache embedding vectors keyed by hash(document_version) so you do not re-embed unchanged PDFs—see OpenAI embeddings guide.
Prompt caching (provider-side): OpenAI documents prompt caching for supported models—large static prefixes can see discounted cached input tokens. Verify eligibility and behavior for your region and model.
3. Batch when latency allows
Interactive chat rarely batches well. Offline jobs (tagging, summarization backfills) can use the Batch API where supported—often cheaper and more tolerant of rate limits.
4. Model routing
Use the smallest model that passes your eval bar for each subtask:
- Router pattern: lightweight classifier or heuristic sends “easy” queries to
gpt-4o-miniand “hard” ones to a larger model. - Structured extraction often works on smaller models if the schema is strict and you validate outputs.
Pair routing with the evaluation loop in evaluating Copilot-style features—routing is useless if quality regresses silently.
5. Prompt discipline = token discipline
- Put instructions once in a system message; avoid repeating boilerplate per turn.
- Prefer JSON or typed outputs with a strict schema—fewer repair retries. OpenAI documents structured outputs for supported models.
- Trim irrelevant history in chat—sliding window or summarization of older turns.
Example (conceptual token saver):
# Avoid: pasting the same 2k-token policy in every user message
# Prefer: system message with policy + short user turns
6. Azure-specific: quotas and deployment SKUs
Throttling (429) causes retries and duplicated work—right-size deployment TPM/RPM per quotas and limits. Sometimes two smaller deployments out-perform one overloaded deployment.
References
- OpenAI: Rate limits
- Azure OpenAI: Model summary — regional availability and lifecycle
- Anthropic prompt engineering overview — many principles transfer across vendors
Related: evaluation methodology, customer data considerations.