WEBHARMONIX
Cost

Cutting inference cost by 40% without losing quality

Routing, caching and distillation strategies that paid off.

By Team Syntheon

LLM costs can spiral fast. Here are three strategies we've used to cut inference costs by 40% while maintaining, and sometimes improving, output quality.

1. Semantic caching

Not every query needs a trip to the LLM. By caching responses to semantically similar queries, we reduced API calls by 25%.

python
import hashlib from numpy import dot from numpy.linalg import norm def cosine_sim(a, b): return dot(a, b) / (norm(a) * norm(b)) class SemanticCache: def __init__(self, threshold=0.92): self.store = {} self.threshold = threshold def get(self, embedding): for key, (cached_emb, response) in self.store.items(): if cosine_sim(embedding, cached_emb) > self.threshold: return response return None

2. Model routing

Not every task needs GPT-4. A classifier routes simple queries (summarization, classification) to cheaper models and reserves the big models for complex reasoning.

3. Prompt distillation

Long prompts are expensive. By distilling your system prompts, removing redundant examples, tightening instructions, you can shave 15-20% off token usage with no quality impact.

  • Audit your prompts monthly for cruft
  • Use few-shot examples sparingly, 3 is often enough
  • Move static content to external retrieval
The most expensive token is the one you didn't need to send.