flatreader

Why your LLM bill is exploding — and how semantic caching can cut it by 73%

Our LLM API bill was growing 30% month-over-month. Traffic was increasing, but not that fast. When I analyzed our query logs, I found the real problem: Users ask the same questions in different ways.

"What's your return policy?," "How do I return something?", and "Can I get a refund?" were all hitting our LLM separately, generating nearly identical responses, each incurring full API costs.

Exact-match caching, the obvious first solution, captured only 18% of these redundant calls. The same semantic question, phrased differently, bypassed the cache entirely.

So, I implemented semantic caching based on what queries mean, not how they're worded. After implementing it, our cache hit rate increased to 67%, reducing LLM API costs by 73%. But getting there requires solving problems that naive implementations miss.

Why exact-match caching falls short

Traditional caching uses query text as the cache key. This works when queries are identical:

# Exact-match caching

cache_key = hash(query_text)

if cache_key in cache:

return cache[cache_key]

But users don't phrase questions identically. My analysis of 100,000 production queries found:

Only 18% were exact duplicates of previous queries
47% were semantically similar to previous queries (same intent, different wording)
35% were genuinely novel queries

That 47% represented massive cost savings we were missing. Each semantically-similar query triggered a full LLM call, generating a response nearly identical to one we'd already computed.

Semantic caching architecture

Semantic caching replaces text-based keys with embedding-based similarity lookup:

class SemanticCache:

def __init__(self, embedding_model, similarity_threshold=0.92):

self.embedding_model = embedding_model

self.threshold = similarity_threshold

self.vector_store = VectorStore() # FAISS, Pinecone, etc.

self.response_store = ResponseStore() # Redis, DynamoDB, etc.

def get(self, query: str) -> Optional[str]:

"""Return cached response if semantically similar query exists."""

query_embedding = self.embedding_model.encode(query)

# Find most similar cached query

matches = self.vector_store.search(query_embedding, top_k=1)

if matches and matches[0].similarity >= self.threshold:

cache_id = matches[0].id

return self.response_store.get(cache_id)

return None

def set(self, query: str, response: str):

"""Cache query-response pair."""

query_embedding = self.embedding_model.encode(query)

cache_id = generate_id()

self.vector_store.add(cache_id, query_embedding)

self.response_store.set(cache_id, {

'query': query,

'response': response,

'timestamp': datetime.utcnow()

})

The key insight: Instead of hashing query text, I embed queries into vector space and find cached queries within a similarity threshold.

The threshold problem

The similarity threshold is the critical parameter. Set it too high, and you miss valid cache hits. Set it too low, and you return wrong responses.

Our initial threshold of 0.85 seemed reasonable; 85% similar should be "the same question," right?

Wrong. At 0.85, we got cache hits like:

Query: "How do I cancel my subscription?"
Cached: "How do I cancel my order?"
Similarity: 0.87

These are different questions with different answers. Returning the cached response would be incorrect.

I discovered that optimal thresholds vary by query type:

Query type	Optimal threshold	Rationale
FAQ-style questions	0.94	High precision needed; wrong answers damage trust
Product searches	0.88	More tolerance for near-matches
Support queries	0.92	Balance between coverage and accuracy
Transactional queries	0.97	Very low tolerance for errors

I implemented query-type-specific thresholds:

class AdaptiveSemanticCache:

def __init__(self):

self.thresholds = {

'faq': 0.94,

'search': 0.88,

'support': 0.92,

'transactional': 0.97,

'default': 0.92

}

self.query_classifier = QueryClassifier()

def get_threshold(self, query: str) -> float:

query_type = self.query_classifier.classify(query)

return self.thresholds.get(query_type, self.thresholds['default'])

def get(self, query: str) -> Optional[str]:

threshold = self.get_threshold(query)

query_embedding = self.embedding_model.encode(query)

matches = self.vector_store.search(query_embedding, top_k=1)

if matches and matches[0].similarity >= threshold:

return self.response_store.get(matches[0].id)

return None

Threshold tuning methodology

I couldn't tune thresholds blindly. I needed ground truth on which query pairs were actually "the same."

Our methodology:

Step 1: Sample query pairs. I sampled 5,000 query pairs at various similarity levels (0.80-0.99).

Step 2: Human labeling. Annotators labeled each pair as "same intent" or "different intent." I used three annotators per pair and took a majority vote.

Step 3: Compute precision/recall curves. For each threshold, we computed:

Precision: Of cache hits, what fraction had the same intent?
Recall: Of same-intent pairs, what fraction did we cache-hit?

def compute_precision_recall(pairs, labels, threshold):

"""Compute precision and recall at given similarity threshold."""

predictions = [1 if pair.similarity >= threshold else 0 for pair in pairs]

true_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 1)

false_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 0)

false_negatives = sum(1 for p, l in zip(predictions, labels) if p == 0 and l == 1)

precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0

recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0

return precision, recall

Step 4: Select threshold based on cost of errors. For FAQ queries where wrong answers damage trust, I optimized for precision (0.94 threshold gave 98% precision). For search queries where missing a cache hit just costs money, I optimized for recall (0.88 threshold).

Latency overhead

Semantic caching adds latency: You must embed the query and search the vector store before knowing whether to call the LLM.

Our measurements:

Operation	Latency (p50)	Latency (p99)
Query embedding	12ms	28ms
Vector search	8ms	19ms
Total cache lookup	20ms	47ms
LLM API call	850ms	2400ms

The 20ms overhead is negligible compared to the 850ms LLM call we avoid on cache hits. Even at p99, the 47ms overhead is acceptable.

However, cache misses now take 20ms longer than before (embedding + search + LLM call). At our 67% hit rate, the math works out favorably:

Before: 100% of queries × 850ms = 850ms average
After: (33% × 870ms) + (67% × 20ms) = 287ms + 13ms = 300ms average

Net latency improvement of 65% alongside the cost reduction.

Cache invalidation

Cached responses go stale. Product information changes, policies update and yesterday's correct answer becomes today's wrong answer.

I implemented three invalidation strategies:

Time-based TTL

Simple expiration based on content type:

TTL_BY_CONTENT_TYPE = {

'pricing': timedelta(hours=4), # Changes frequently

'policy': timedelta(days=7), # Changes rarely

'product_info': timedelta(days=1), # Daily refresh

'general_faq': timedelta(days=14), # Very stable

}

Event-based invalidation

When underlying data changes, invalidate related cache entries:

class CacheInvalidator:

def on_content_update(self, content_id: str, content_type: str):

"""Invalidate cache entries related to updated content."""

# Find cached queries that referenced this content

affected_queries = self.find_queries_referencing(content_id)

for query_id in affected_queries:

self.cache.invalidate(query_id)

self.log_invalidation(content_id, len(affected_queries))

Staleness detection

For responses that might become stale without explicit events, I implemented periodic freshness checks:

def check_freshness(self, cached_response: dict) -> bool:

"""Verify cached response is still valid."""

# Re-run the query against current data

fresh_response = self.generate_response(cached_response['query'])

# Compare semantic similarity of responses

cached_embedding = self.embed(cached_response['response'])

fresh_embedding = self.embed(fresh_response)

similarity = cosine_similarity(cached_embedding, fresh_embedding)

# If responses diverged significantly, invalidate

if similarity < 0.90:

self.cache.invalidate(cached_response['id'])

return False

return True

We run freshness checks on a sample of cached entries daily, catching staleness that TTL and event-based invalidation miss.

Production results

After three months in production:

Metric	Before	After	Change
Cache hit rate	18%	67%	+272%
LLM API costs	$47K/month	$12.7K/month	-73%
Average latency	850ms	300ms	-65%
False-positive rate	N/A	0.8%	—
Customer complaints (wrong answers)	Baseline	+0.3%	Minimal increase

The 0.8% false-positive rate (queries where we returned a cached response that was semantically incorrect) was within acceptable bounds. These cases occurred primarily at the boundaries of our threshold, where similarity was just above the cutoff but intent differed slightly.

Pitfalls to avoid

Don't use a single global threshold. Different query types have different tolerance for errors. Tune thresholds per category.

Don't skip the embedding step on cache hits. You might be tempted to skip embedding overhead when returning cached responses, but you need the embedding for cache key generation. The overhead is unavoidable.

Don't forget invalidation. Semantic caching without invalidation strategy leads to stale responses that erode user trust. Build invalidation from day one.

Don't cache everything. Some queries shouldn't be cached: Personalized responses, time-sensitive information, transactional confirmations. Build exclusion rules.

def should_cache(self, query: str, response: str) -> bool:

"""Determine if response should be cached.""

# Don't cache personalized responses

if self.contains_personal_info(response):

return False

# Don't cache time-sensitive information

if self.is_time_sensitive(query):

return False

# Don't cache transactional confirmations

if self.is_transactional(query):

return False

return True

Key takeaways

Semantic caching is a practical pattern for LLM cost control that captures redundancy exact-match caching misses. The key challenges are threshold tuning (use query-type-specific thresholds based on precision/recall analysis) and cache invalidation (combine TTL, event-based and staleness detection).

At 73% cost reduction, this was our highest-ROI optimization for production LLM systems. The implementation complexity is moderate, but the threshold tuning requires careful attention to avoid quality degradation.

Sreenivasa Reddy Hulebeedu Reddy is a lead software engineer.

Why your LLM bill is exploding — and how semantic caching can cut it by 73%

Why exact-match caching falls short

Semantic caching architecture

The threshold problem

Threshold tuning methodology

Latency overhead

Cache invalidation

Time-based TTL

Event-based invalidation

Staleness detection

Production results

Pitfalls to avoid

Key takeaways