A single conversation can fan out into a dozen calls — embedding, retrieval, model inference, post-processing, analytics. Without a clear caching strategy, AI bills become unpredictable, and unpredictable AI bills break SaaS pricing.
We run six layers, all backed by Redis: semantic cache (similarity over vectorized prompts), prompt-level cache (provider-side when supported), KB retrieval cache, widget-config edge cache, analytics dashboard cache, and ingestion-status cache.
The semantic layer is the most opinionated of the six. We hold the similarity threshold deliberately high — false positives in support are worse than misses, because a wrong cached answer destroys trust.
Prompt cache is the easiest win when you switch to a model that supports it. We get it for free when we run on Anthropic models on Bedrock; OpenAI on Bedrock does not yet.
The end result: most active projects hit cache for over half their conversations, and the pieces that do not cache are bounded in cost because retrieval is bounded.