Reduce LLM Costs with Semantic Caching: The Gateway Approach
LLM API costs scale linearly with request volume. Applications with repeated or semantically similar queries pay for inference on every request, even when the responses would be effectively identical. Semantic caching at the gateway layer intercepts these requests before they reach a provider and returns cached responses for queries that are semantically close to previously seen inputs. Bifrost, the open-source AI gateway written in Go by Maxim AI, implements semantic caching at the infrastructure layer with a configurable vector store backend and similarity threshold controls. No application code changes are required: caching applies to any application that routes through the gateway.
Why the Gateway Is the Right Place for Semantic Caching
Gateway-level semantic caching outperforms application-level caching implementations on four dimensions:
Single cache serves all applications. Application-level caching creates isolated cache stores per service. If two services ask the same question, each pays for inference and maintains its own cache. A gateway cache is shared: one cache lookup handles all applications routing through the gateway, and a cache hit for one application benefits every other application asking the same question.
Cross-provider cache applicability. A response cached from an OpenAI call can be served to a request routed to Anthropic or any other provider. The cache operates on the query semantics, not the provider. If your routing configuration shifts traffic between providers (for example, due to a failover or a cost-optimization routing rule), cached responses remain valid and continue to serve hits regardless of which provider would have handled the request.
No application code changes required. Application-level caching requires each development team to implement cache logic, manage TTLs, handle cache invalidation, and maintain embedding infrastructure. Gateway-level caching centralizes all of that: enable semantic caching in the gateway configuration, and every application that sends requests through the gateway benefits without any code changes.
Cache metrics centralized in observability. With application-level caching, each application tracks its own cache hit rate independently. Gateway-level caching surfaces cache hit rates, tokens saved, and API calls avoided through the gateway's observability layer, giving teams a unified view of caching effectiveness across all applications and all providers.
How Semantic Caching Works at the Gateway Layer
When a request arrives at the gateway with caching enabled, the processing sequence is:
- Normalize the request: the query text is extracted from the request body.
- Direct (hash) lookup: the normalized request is hashed and checked against the cache store. An exact match returns the cached response immediately, with no embedding call required.
- Semantic lookup (on direct miss): the query is sent to an embedding provider to generate a vector representation. That vector is compared against stored vectors in the vector store using similarity search. If the highest-scoring match exceeds the configured similarity threshold, the associated cached response is returned.
- Provider call (on cache miss): if no cached response meets the threshold, the request proceeds to the provider. The response is stored asynchronously after delivery, so the first request is never blocked by a cache write.
The key difference from exact-match caching is step 3. Exact-match caching requires the query text to be identical character-for-character. It misses paraphrases, different orderings of the same question, and minor wording variations. Semantic caching catches these by comparing meaning rather than text, which produces substantially higher cache hit rates for natural language queries.
Threshold trade-offs. A higher similarity threshold requires the incoming query to be very close to a cached query before a hit is returned. This minimizes the risk of returning a slightly wrong cached answer but reduces the cache hit rate. A lower threshold catches more paraphrases but increases the risk of returning a cached response that does not precisely address the incoming query. The right threshold depends on the application: for FAQ bots where questions cluster tightly, a lower threshold works well; for document analysis where queries vary more widely, a higher threshold prevents incorrect cache hits.
TTL configuration. Cache entries expire after a configured time-to-live. TTL should reflect how quickly the underlying data or LLM behavior changes. For static FAQ content, a long TTL (days or weeks) is appropriate. For queries against frequently updated data, a short TTL prevents serving stale cached responses. Bifrost's semantic caching configuration allows TTL to be set per cache key.
Which Workloads See the Highest Cache Hit Rates
Not all workloads benefit equally from semantic caching. The following categories typically produce the highest cache hit rates:
Support bots and FAQ agents. Users asking for help with a product or service ask the same questions in different words: "how do I reset my password," "I forgot my password," "can't log in, need to reset." Semantic caching collapses these into a single cached response with minimal threshold adjustment needed, because the intent is identical.
Document analysis pipelines. When the same documents are analyzed repeatedly with similar prompts (for example, a contract review workflow where every contract is analyzed with the same extraction prompt, or a compliance check that runs the same policy questions against multiple documents), the queries differ only by document content. If the analyzed document text is not included in the cache key, the analysis prompt structure is highly cacheable.
Code review workflows. Automated code review tools generate similar prompts for common code patterns: "review this function for security vulnerabilities," "check this SQL query for injection risks." The prompt structure is nearly identical across different files of the same type. Semantic caching serves cached reviews for patterns that are functionally equivalent.
RAG-based Q&A systems. In a retrieval-augmented generation system, users query the same knowledge base with paraphrased versions of the same questions. The retrieved context and the underlying question are often semantically close across queries, producing strong cache hit rates when similarity thresholds are calibrated to the knowledge base vocabulary.
Summarization services. When multiple users request summaries of the same source material, each request is semantically identical regardless of how the summary request is phrased. Gateway caching serves the same summary to all users asking about the same content without repeated provider calls.
How Bifrost Implements Semantic Caching
Bifrost's semantic caching operates in two modes: direct (hash-based exact match) and semantic (embedding-based similarity). Both modes run against the same vector store backend; direct mode uses hash lookups while semantic mode uses vector similarity search. The two modes can run together (direct first, semantic on direct miss) or independently.
Supported vector store backends are Redis/Valkey (recommended for direct-only mode), Weaviate, Qdrant, and Pinecone. The vector store must be configured before semantic caching can be enabled. Embedding providers are configured separately; any embedding-capable provider in Bifrost's provider list can generate the vectors used for similarity lookup.
The cross-provider cache behavior is a consequence of how the cache key is structured: the cache key is based on the query content, not on the provider or model. A response cached from an OpenAI call is retrievable by a request that would have been routed to Anthropic, as long as the query similarity exceeds the threshold.
Virtual key governance integrates with caching at the per-consumer level. Cache keys can be scoped to a specific virtual key, allowing different consumers to have isolated cache spaces. This is useful when different teams or customers need cache isolation for privacy or correctness reasons, while still benefiting from the shared gateway caching infrastructure.
For MCP agentic workloads, Code Mode provides a complementary cost-reduction mechanism: rather than serving cached responses, it reduces the number of tokens consumed per request by replacing large tool catalogs with four meta-tools and a Python execution sandbox. At 508 tools across 16 MCP servers, Code Mode reduces input tokens by 92.8%. The MCP gateway resource page covers this in detail.
Measuring Cost Reduction from Semantic Caching
The relevant metrics for quantifying semantic caching impact are:
- Cache hit rate: the percentage of requests served from cache rather than from a provider. A 30% cache hit rate means 30% of inference costs are eliminated for the cached workload.
- Tokens saved per cache hit: the number of input and output tokens that would have been consumed by the provider call, multiplied by the provider's per-token rate.
- API calls avoided: the total number of provider calls eliminated by caching. This translates directly to cost and latency savings.
- Cost per request with and without caching: comparing average cost before and after enabling caching provides the clearest picture of the financial impact.
Bifrost's observability layer surfaces these metrics per virtual key, so teams can see caching effectiveness broken down by consumer, project, or application. Budget limits can be tracked alongside caching metrics to give a complete picture of AI spend per consumer.
Deploying Semantic Caching with Bifrost
The deployment sequence for semantic caching in Bifrost:
- Configure a vector store: add the vector store connection details (Redis/Valkey, Weaviate, Qdrant, or Pinecone) to the Bifrost configuration file.
- Enable caching: turn on the semantic cache plugin in the Bifrost settings.
- Set the similarity threshold: start with a threshold of 0.85-0.90 for most applications; adjust based on observed cache hit rates and response quality.
- Set TTL: configure the time-to-live for cache entries based on how quickly the underlying data changes.
- Deploy: the quickstart guide and provider configuration docs cover the full setup process.
Applications route through Bifrost without any code changes. Cache hits are transparent to the calling application: the response format is identical to a provider response, with the same latency characteristics as a direct hit against the gateway's vector store.
For teams evaluating the full scope of LLM cost optimization options, the LLM Gateway Buyer's Guide covers semantic caching alongside load balancing, virtual key governance, and provider fallback configurations.
Start Reducing LLM Costs Today
Semantic caching at the gateway layer is the most operationally efficient way to reduce LLM API costs: it requires no application code changes, applies across all providers, and surfaces centralized metrics for tracking cost reduction over time.
To see how Bifrost's semantic caching and the full gateway feature set apply to your AI infrastructure, book a demo with the Bifrost team, or explore the governance resource page for the full picture of cost control capabilities.