Optimizing Token Consumption: Semantic Caching and Dynamic Routing
Token consumption is the primary driver of LLM API costs at scale, and most production applications accumulate that consumption from three separate sources: repeated queries that could be served from cache, workloads routed to expensive frontier models when a smaller model would produce equivalent output, and agentic tool-use loops that generate more tokens per interaction than necessary. Bifrost, an open-source AI gateway built in Go that routes traffic across 1,000+ models and 20+ providers, provides three distinct mechanisms that directly target each of these sources. This article explains how each mechanism works technically, how to configure each one, and how to sequence their adoption in a production environment.
The Three Sources of Unnecessary Token Consumption
Token inefficiency in production AI systems almost always falls into one of three categories, and identifying which category your system suffers from most is the first step toward reducing consumption.
Repeated and paraphrased queries are the most common source. Users asking a support chatbot "how do I reset my password?" and "what are the steps to change my password?" are expressing the same intent with different phrasing. Without semantic caching, each variant triggers a fresh inference call, consuming tokens proportional to the prompt length plus the generated response. In high-traffic systems, a small set of semantically similar intents can account for a large share of total inference calls.
Workload-model mismatch is the second major source. Teams often default to a single frontier model (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) for all tasks regardless of complexity. Batch summarization jobs, classification tasks, and structured extraction workloads rarely require frontier-model capability. Routing these to GPT-4o-mini or Claude 3 Haiku can produce equivalent results at a fraction of the token cost per input.
Tool-use token overhead in agentic workflows is the third source, and it compounds quickly. Each MCP tool call in an agentic loop carries context representation overhead. When an agent makes dozens of sequential tool calls per session, the per-call overhead accumulates into a significant fraction of total session token consumption. This is qualitatively different from the first two sources because it is structural: it occurs even when each individual query is unique and correctly routed.
Semantic Caching: Eliminating Repeated Inference
Semantic caching works by converting incoming prompts into vector embeddings and comparing those embeddings against a cache of previously answered queries using cosine similarity. This is fundamentally different from exact-match caching, which only returns a cached result when the input string matches exactly. Cosine similarity measures the angle between two vectors in embedding space: a score of 1.0 means the vectors are identical, while scores closer to 0 indicate unrelated meaning.
The practical consequence is that paraphrased queries, "how do I reset my password?" vs. "what are the steps to change my password?", produce embeddings that are close in vector space and will share a high cosine similarity score, typically 0.90 or above depending on the embedding model. Exact-match caching would miss this entirely and trigger two separate inference calls.
Configuring the similarity threshold is the key operational decision. A threshold of 0.95 is conservative: it will only serve a cached result when the incoming query is very close semantically to a cached query, which minimizes the risk of returning an incorrect response to a subtly different question. A threshold of 0.80 is aggressive: it catches more paraphrases but increases the probability of a false positive, where a query that is merely topically related (but not semantically equivalent) gets served a cached response.
For most production deployments, starting at 0.95 and lowering incrementally based on measured cache hit rate is the recommended approach. TTL (time-to-live) configuration is equally important: caching responses to queries about dynamic data (current inventory, live pricing, real-time status) requires shorter TTLs than caching responses to stable reference content.
One advantage of placing semantic caching inside the gateway rather than at the application layer is cross-provider cache benefit. A cached response generated by Claude 3.5 Sonnet can be returned for a semantically similar query that would otherwise route to GPT-4o. The cache operates on semantic intent, not on which provider or model was used for the original inference.
Dynamic Routing: Matching Workloads to the Right Model
Routing rules in Bifrost let you define structured conditions that direct requests to specific providers or models based on metadata, time of day, request properties, or virtual key identity. For token optimization, routing rules serve one primary purpose: ensuring that expensive frontier models are reserved for workloads that actually require their capability.
A concrete routing configuration for a team running both interactive chat (requiring low latency and high reasoning quality) and batch document summarization (where latency is irrelevant and quality on well-structured text is similar across model tiers) would look like this:
{
"routing_rules": [
{
"condition": {
"metadata_key": "workload_type",
"metadata_value": "batch_summarization"
},
"target": {
"provider": "openai",
"model": "gpt-4o-mini"
}
},
{
"condition": {
"metadata_key": "workload_type",
"metadata_value": "interactive_chat"
},
"target": {
"provider": "anthropic",
"model": "claude-3-5-sonnet-20241022"
}
}
]
}
The application sets workload_type as metadata on each request. Bifrost evaluates the routing rules in order and directs the request to the matching provider-model pair. The provider routing layer handles the actual dispatch across Bifrost's 20+ supported providers.
Time-based routing is another practical pattern: directing requests during off-peak hours to providers with lower pricing tiers, or to models that have more available capacity. This does not require application-side changes once the routing rules are defined in the gateway configuration.
Per-virtual-key routing is particularly useful in multi-tenant systems. Each tenant or internal team gets a virtual key, and routing rules can be scoped to that key so that one team's batch jobs do not compete with another team's interactive workloads for the same model endpoint. This also makes token usage attribution straightforward: the observability layer tracks consumption per virtual key, so you can see exactly which consumers are driving which share of token spend.
Automatic fallbacks complement routing rules by handling provider availability issues without requiring manual intervention. If a target provider is unavailable or returns an error, Bifrost falls back to the next configured provider, preserving the workload-to-model intent of the original routing rule where possible.
Code Mode: Reducing Per-Tool-Call Token Overhead in Agentic Workloads
Code Mode is a distinct token optimization mechanism that applies specifically to MCP-based agent workflows. It reduces token consumption at the individual tool-call level by compressing the context representation of each tool interaction, resulting in 50% fewer tokens and 40% lower latency per tool-use interaction.
The mechanism matters because agentic workloads have a different cost structure than single-turn inference. A coding agent or a research agent might make 30 to 50 sequential tool calls in a single session. If each tool call carries 500 tokens of overhead in its context representation, that is 15,000 to 25,000 tokens of overhead per session before any actual reasoning or output generation. Code Mode targets this overhead directly.
For teams running Bifrost's MCP gateway, Code Mode is applied at the gateway layer. The application does not need to change its tool call format; the gateway handles the compression before the request reaches the provider. The MCP resource page covers the full architecture, and the MCP blog post includes detailed measurements of token reduction at scale.
For teams where agentic workflows are a significant fraction of total token consumption, Code Mode can be the single highest-impact optimization available, because it compounds across every tool call in every session.
Combining All Three Mechanisms
Semantic caching, routing rules, and Code Mode address different parts of the token consumption problem and can be applied simultaneously without conflicts.
Semantic caching operates before routing: if a request matches a cached result above the similarity threshold, it never reaches the routing layer or the provider. This means caching reduces the total number of requests that consume tokens at all.
Routing rules operate between the gateway and the provider: for requests that do reach a provider (because they were not cache hits), routing rules ensure the cheapest appropriate model handles the request. This reduces the per-token cost of inference for the requests that do go out.
Code Mode operates at the tool-call level within individual requests: for agentic workloads that are making tool calls, Code Mode reduces the token overhead of each call regardless of which provider or model was selected by the routing rules. The three mechanisms target different points in the request lifecycle, so their benefits stack additively rather than competing.
A team using all three in combination sees cache hits eliminate a fraction of all requests, routing rules reduce the cost of remaining requests, and Code Mode reduce the overhead of agentic sessions among those remaining requests.
Where to Start: A Token Optimization Sequence
For teams new to systematic token optimization, a sequenced adoption approach reduces risk and makes it easier to attribute improvements to specific changes.
Step 1: Enable semantic caching at a conservative threshold. Start with a cosine similarity threshold of 0.95 and monitor cache hit rate over one week. A hit rate below 5% suggests queries are highly diverse; a hit rate above 30% suggests significant opportunity. Enable observability at the same time so you have a baseline for total token consumption before any other changes.
Step 2: Identify workloads by token volume and add routing rules. Review per-virtual-key token usage in the observability dashboard to find which consumers are spending the most tokens. For each high-volume consumer, evaluate whether the workload genuinely requires a frontier model or could route to a cheaper alternative. Add routing rules for workloads where a smaller model is appropriate.
Step 3: Enable Code Mode if running MCP-based agents. If any of your high-volume consumers are agentic workflows using MCP tools, enable Code Mode and measure per-session token counts before and after. The reduction is typically measurable within a single day of traffic.
Step 4: Tune the caching threshold. After one to two weeks of data, adjust the similarity threshold based on observed cache hit rate and any user-reported accuracy issues. If users are occasionally receiving cached responses that are slightly off-topic, raise the threshold. If hit rate is low and your queries are genuinely paraphrase-heavy, lower it incrementally.
Step 5: Review the load balancing and provider key configuration to ensure that routing rules have adequate capacity at their target providers and that no single provider key is a bottleneck.
Measuring Token Optimization Results
Bifrost's observability layer provides the metrics needed to evaluate each optimization mechanism independently.
For semantic caching: cache hit rate (the percentage of requests served from cache), tokens saved by caching (the estimated inference tokens avoided), and cache miss latency (to confirm that cache lookups are not adding meaningful overhead to uncached requests).
For routing rules: model distribution across requests (what share of requests are routing to each model tier), per-virtual-key token usage trends over time, and provider error rates (to catch cases where routing rules are sending traffic to providers with availability issues).
For Code Mode: per-session token counts in agentic workflows, tool-call count per session, and latency per tool call before and after enabling Code Mode. The benchmarks resource page provides reference measurements for Bifrost's overhead at various traffic levels, including the 11-microsecond per-request overhead at 5,000 requests per second that establishes a baseline for gateway-introduced latency.
Reviewing these metrics together gives a complete picture of where token consumption is occurring and which mechanisms are contributing the most to reduction.
Start Optimizing Token Consumption with Bifrost
Token consumption in production AI systems is reducible through mechanisms that address its three structural sources: semantic caching for repeated inference, routing rules for workload-model mismatch, and Code Mode for agentic tool-call overhead. Each mechanism is independently deployable and provides measurable results within days of adoption.
Bifrost is open source and can be deployed in front of any combination of providers without changes to application code. For teams managing token consumption at enterprise scale, the enterprise page covers dedicated support and custom deployment options. To discuss your specific token optimization requirements, book a demo.