← All field notesEngineering · March 18, 2026 · 7 min read

The six caches that keep our AI bill predictable

Token costs add up fast on a multi-tenant SaaS. Here is the layered cache strategy we use to keep margins healthy without hurting freshness.

Yusuf Al-RashidEngineering

A single conversation can fan out into a dozen calls — embedding, retrieval, model inference, post-processing, analytics. Without a clear caching strategy, AI bills become unpredictable, and unpredictable AI bills break SaaS pricing.

We run six layers, all backed by Redis: semantic cache (similarity over vectorized prompts), prompt-level cache (provider-side when supported), KB retrieval cache, widget-config edge cache, analytics dashboard cache, and ingestion-status cache.

The semantic layer is the most opinionated of the six. We hold the similarity threshold deliberately high — false positives in support are worse than misses, because a wrong cached answer destroys trust.

Prompt cache is the easiest win when you switch to a model that supports it. We get it for free when we run on Anthropic models on Bedrock; OpenAI on Bedrock does not yet.

The end result: most active projects hit cache for over half their conversations, and the pieces that do not cache are bounded in cost because retrieval is bounded.

More from Nexus

Browse all posts