← All field notesEngineering · March 18, 2026 · 7 min read

The six caches that keep our AI bill predictable

Token costs add up fast on a multi-tenant SaaS. Here is the layered cache strategy we use to keep margins healthy without hurting freshness.

Yusuf Al-RashidEngineering

A single conversation can fan out into a dozen calls — embedding, retrieval, model inference, post-processing, analytics. Without a clear caching strategy, AI bills become unpredictable, and unpredictable AI bills break SaaS pricing.

We run six layers, all backed by Redis: semantic cache (similarity over vectorized prompts), prompt-level cache (provider-side when supported), KB retrieval cache, widget-config edge cache, analytics dashboard cache, and ingestion-status cache.

The semantic layer is the most opinionated of the six. We hold the similarity threshold deliberately high — false positives in support are worse than misses, because a wrong cached answer destroys trust.

Prompt cache is the easiest win when you switch to a model that supports it. We get it for free when we run on Anthropic models on Bedrock; OpenAI on Bedrock does not yet.

The end result: most active projects hit cache for over half their conversations, and the pieces that do not cache are bounded in cost because retrieval is bounded.