LLM costs can spiral fast. Here are three strategies we've used to cut inference costs by 40% while maintaining, and sometimes improving, output quality.
1. Semantic caching
Not every query needs a trip to the LLM. By caching responses to semantically similar queries, we reduced API calls by 25%.
pythonimport hashlib from numpy import dot from numpy.linalg import norm def cosine_sim(a, b): return dot(a, b) / (norm(a) * norm(b)) class SemanticCache: def __init__(self, threshold=0.92): self.store = {} self.threshold = threshold def get(self, embedding): for key, (cached_emb, response) in self.store.items(): if cosine_sim(embedding, cached_emb) > self.threshold: return response return None
2. Model routing
Not every task needs GPT-4. A classifier routes simple queries (summarization, classification) to cheaper models and reserves the big models for complex reasoning.
3. Prompt distillation
Long prompts are expensive. By distilling your system prompts, removing redundant examples, tightening instructions, you can shave 15-20% off token usage with no quality impact.
- Audit your prompts monthly for cruft
- Use few-shot examples sparingly, 3 is often enough
- Move static content to external retrieval
The most expensive token is the one you didn't need to send.