Managing LLM Spend Across Providers With an AI Gateway
Enterprises now run AI on multiple model providers at once, and the bill arrives as separate invoices from OpenAI, Anthropic, Google, and Amazon Bedrock with no shared view of what each team, feature, or customer actually consumed. Bifrost, the open-source AI gateway built in Go by Maxim AI, sits between applications and every provider so that managing LLM spend across providers happens at one layer, with budgets, rate limits, and usage tracking enforced before a request ever reaches a provider. The FinOps Foundation's State of FinOps 2026 report, drawn from practitioners responsible for more than $83 billion in cloud spend, found that 98% of respondents now manage AI spend, up from 63% a year earlier and 31% the year before. This post covers why multi-provider spend is hard to control and how an AI gateway approach makes it governable.
Why Managing LLM Spend Across Providers Is Hard
Managing LLM spend across providers is hard because token-priced APIs do not fit the cost controls that existing cloud tooling was built for. Each provider bills independently, usage is metered per token rather than per instance, and spend cannot be attributed to a team or customer without instrumentation at the request layer.
The structural problems most teams run into:
- Fragmented billing: OpenAI, Anthropic, Google Vertex AI, and AWS Bedrock each produce a separate invoice, so there is no single number for total AI spend.
- No cost attribution: Provider dashboards report spend by API key, not by the team, feature, or customer that generated it. A 2025 analysis from Menlo Ventures found that 78% of companies now use two or more LLM families, which multiplies the number of bills to reconcile.
- No pre-spend enforcement: Provider rate limits prevent overload, not overspend. A misconfigured job or a runaway agent can exhaust a monthly budget before anyone reviews a dashboard.
- Redundant inference: Repeated and near-identical prompts are re-sent to providers and paid for again, even when the answer has not changed.
Spend pressure is rising alongside multi-provider adoption. Flexera's 2026 State of the Cloud Report reported that estimated cloud waste rose to 29%, driven in part by AI workloads. Controlling that waste requires a control point that sees every request across every provider.
The AI Gateway Approach to Cost Control
An AI gateway is a unified entry point that routes, authenticates, observes, and governs traffic to multiple LLM providers from a single API. For cost control, the gateway is the one place that sees every request before it reaches a provider, which makes it the natural enforcement point for budgets, rate limits, and usage tracking.
Bifrost unifies access to 1000+ models through a single OpenAI-compatible API, and adopting it requires changing only the base URL in existing code through its drop-in replacement support for the OpenAI, Anthropic, and other provider SDKs. Once traffic flows through one gateway, spend management moves from four provider dashboards to a single layer that applies the same policies to every request, regardless of which provider serves it. For teams evaluating this model, the LLM Gateway Buyer's Guide breaks down the capabilities to compare.
The gateway approach changes cost management in three ways:
- Centralized enforcement: Budgets and rate limits are applied at the gateway, so spend is capped before a provider is called rather than reconciled after the invoice.
- Unified attribution: Every request carries an identity, so spend can be reported per team, per customer, or per project across all providers at once.
- Spend reduction at the request layer: Caching, routing, and token-efficient tool execution remove cost from the request path without changing application logic.
How Bifrost Governs Spend With Hierarchical Budgets
Bifrost governs spend through hierarchical budgets that map to how organizations are structured. Budgets are set independently at the customer, team, virtual key, and provider-config levels, and every applicable budget in the hierarchy is checked before a request proceeds.
Virtual keys are the primary governance entity. Each virtual key carries its own access permissions, budget, and rate limits, and can be attached to a team or a customer. The budget hierarchy works as follows:
- Customer budget: Organization-wide or business-unit cost ceiling, the highest level in the hierarchy.
- Team budget: Department-level allocation, separate from the customer budget.
- Virtual key budget: Per-project or per-application limit, checked alongside any team and customer budgets attached to it.
- Provider-config budget: A granular cap per provider within a single virtual key.
When a request arrives with a virtual key, Bifrost checks all applicable budgets independently, and any single budget without sufficient remaining balance blocks the request with a budget-exceeded error. Budgets reset on a configurable duration (minute, hour, day, week, month, or year), and can align to calendar boundaries. This governance model means a runaway agent under one team cannot consume another team's allocation, and no virtual key can spend past the organization-wide ceiling.
Cost figures are not estimated after the fact. Bifrost calculates the cost of every request from real-time provider pricing and the input and output tokens returned by the provider, so budget checks reflect actual spend per model.
What is the difference between budgets and rate limits in Bifrost?
Budgets cap spend in dollars over a reset period, while rate limits cap throughput in tokens or requests over a reset period. Budgets live at the customer, team, virtual key, and provider-config levels; rate limits apply at the virtual key and provider-config levels only. A request must pass both checks to proceed.
How does Bifrost attribute spend to a team or customer?
Spend is attributed through the virtual key on each request. Because a virtual key is attached to a team or customer, every request it makes is counted against that entity's budget and usage, giving per-team and per-customer spend reporting across all providers without parsing four separate provider invoices.
Reducing Spend With Routing and Semantic Caching
Beyond enforcing limits, Bifrost reduces the spend itself through provider routing and caching at the request layer. These mechanisms remove cost from the request path without requiring application changes.
Provider routing lets teams direct traffic to specific models and providers through governance rules configured on virtual keys. A virtual key can restrict an application to lower-cost models for routine work while reserving higher-cost models for tasks that require them. Providers that have exceeded their budget or rate limits are excluded from routing automatically, so spend stays within configured ceilings even as traffic shifts between providers.
Semantic caching cuts spend on repeated work by serving a stored response instead of paying for a new provider call. Bifrost offers two complementary lookup paths:
- Direct (hash) matching: Deterministic, exact-match replay of an identical request, with no embedding cost.
- Semantic (similarity) matching: Embedding-based lookup that serves a cached answer when a new request is close enough to a previous one, even if the wording differs.
Both paths replay the answer rather than re-invoking the provider, which removes the paid LLM call for repeated or near-identical prompts. Caching is scoped per cache key, so a response cached for one tenant or session is never served to another.
For teams running large MCP deployments, Code Mode reduces token spend by changing how tools are exposed to the model. Instead of including every tool definition in every request, Code Mode exposes four generic tools and lets the model write code to orchestrate the rest in a sandbox. Bifrost's published benchmarks show that this approach reduced input tokens by up to 92.8% and estimated cost by up to 92.2% across large MCP deployments. The methodology and per-round results are documented in the MCP gateway cost-governance writeup.
Tracking LLM Spend With Built-In Observability
Bifrost tracks LLM spend through built-in observability that captures tokens, cost, and latency for every request across every provider, without changes to application code. The logging operates asynchronously, so it adds no latency to the request path.
Each request log records:
- Token usage: Input and output tokens for the request and response.
- Cost: The calculated cost of the request, derived from real-time provider pricing.
- Provider and model: Which provider and model served the request.
- Cache status: Whether the response was served from cache, including hit type and similarity score.
- Latency and status: Request duration and success or error details.
Because every request is logged with cost and identity, spend can be analyzed by provider, model, team, or cache outcome from one place. For deeper integration, Bifrost exposes native Prometheus metrics and OpenTelemetry traces, so AI spend data feeds the same Grafana, Datadog, or New Relic dashboards a platform team already runs. This connects LLM cost data to existing FinOps and observability workflows rather than isolating it in a separate tool.
Spend Governance at Enterprise Scale
Bifrost is built for enterprises running mission-critical AI workloads, and its spend-governance model extends to the access control, compliance, and deployment requirements that regulated organizations carry. Cost controls do not stand alone; they sit inside a broader governance layer.
Enterprise capabilities relevant to spend governance include:
- Role-based access control: Fine-grained permissions over who can create virtual keys, set budgets, and view spend, documented in the RBAC docs.
- Audit logs: Immutable trails for budget changes and access, supporting SOC 2, GDPR, HIPAA, and ISO 27001 requirements through audit logging.
- In-VPC and on-prem deployment: For organizations that require AI traffic to stay inside their own infrastructure, covered on the Bifrost Enterprise page.
This combination matters because spend governance and access governance are the same problem viewed from two angles: both depend on a single control point that sees and authorizes every request. Routing all provider traffic through one gateway gives finance and platform teams a shared, enforceable view of AI spend, and gives security teams the access boundaries they need at the same layer. The Bifrost resources hub collects the buyer's guide, governance, and benchmark material for teams scoping a rollout.
Start Managing LLM Spend Across Providers With Bifrost
Managing LLM spend across providers is a control-point problem: without a single layer that sees every request, budgets are reconciled after the invoice and spend cannot be attributed to the team or customer that generated it. Bifrost, the open-source AI gateway by Maxim AI, makes spend governable by enforcing hierarchical budgets and rate limits before requests reach a provider, reducing cost through routing and semantic caching, and tracking tokens and cost for every request in one place. To see how Bifrost can centralize and control your AI spend across providers, book a demo with the Bifrost team.