Cutting inference cost by 40%, Syntheon

LLM costs can spiral fast. Here are three strategies we've used to cut inference costs by 40% while maintaining, and sometimes improving, output quality.

1. Semantic caching

Not every query needs a trip to the LLM. By caching responses to semantically similar queries, we reduced API calls by 25%.

python
import hashlib
from numpy import dot
from numpy.linalg import norm

def cosine_sim(a, b):
    return dot(a, b) / (norm(a) * norm(b))

class SemanticCache:
    def __init__(self, threshold=0.92):
        self.store = {}
        self.threshold = threshold
    
    def get(self, embedding):
        for key, (cached_emb, response) in self.store.items():
            if cosine_sim(embedding, cached_emb) > self.threshold:
                return response
        return None

2. Model routing

Not every task needs GPT-4. A classifier routes simple queries (summarization, classification) to cheaper models and reserves the big models for complex reasoning.

3. Prompt distillation

Long prompts are expensive. By distilling your system prompts, removing redundant examples, tightening instructions, you can shave 15-20% off token usage with no quality impact.

Audit your prompts monthly for cruft
Use few-shot examples sparingly, 3 is often enough
Move static content to external retrieval

The most expensive token is the one you didn't need to send.

Cutting inference cost by 40% without losing quality

1. Semantic caching

2. Model routing

3. Prompt distillation