Try Bifrost Enterprise free for 14 days. Request access

See Every Request, Token, and Cost: Bifrost AI Gateway

See Every Request, Token, and Cost: Bifrost AI Gateway
TLDR: Production AI systems running across multiple providers accumulate costs that are nearly impossible to attribute without a centralized gateway. Bifrost provides real-time visibility into every request, token count, cost, and error across all providers and consumers, so engineering and finance teams always know where AI spend is going.

When organizations run AI workloads across OpenAI, Anthropic, AWS Bedrock, and Google Vertex simultaneously, understanding total spend, per-consumer usage, and provider error rates requires data from every request passing through the system. Without a centralized gateway, teams fall back on per-provider billing dashboards that cannot be correlated across vendors or attributed to individual teams and applications. Bifrost, an open-source AI gateway built in Go, solves this by routing all LLM traffic through a single layer that captures structured telemetry for every call, regardless of which provider handles it.

What AI Gateway Observability Covers

AI gateway observability is visibility into the full lifecycle of every LLM request passing through the gateway: the provider called, the model used, the token count, the latency, the cost, the virtual key (consumer identity), and any errors or fallbacks triggered. It allows teams to attribute AI spend, detect anomalies, and trace failures without manually aggregating per-provider dashboards.

The Observability Gap in Direct Provider API Access

When application code calls provider APIs directly, each provider returns its own billing data in its own format on its own schedule. OpenAI usage appears in the OpenAI dashboard, Anthropic usage in the Anthropic console, Bedrock usage in AWS Cost Explorer. None of these sources share a common schema, so correlating a spike in total AI spend to a specific provider, model, or application requires manual cross-referencing that rarely happens in real time.

The attribution problem compounds when multiple teams share the same provider account. Without request-level tagging enforced at the gateway, there is no reliable way to determine which team consumed which tokens. Ad hoc solutions, like adding custom headers or logging request metadata in application code, are inconsistent across codebases and break whenever a new model or provider is added.

Error correlation is equally fragmented. A 503 from AWS Bedrock and a 429 from OpenAI both result in a failed request for the end user, but diagnosing the root cause means checking two separate dashboards with different retention windows and log formats. Real-time error rate visibility across all providers simply does not exist without a shared routing layer sitting in front of all provider traffic.

Bifrost's Built-In Observability Layer

Bifrost captures structured telemetry for every request at the gateway level, adding only 11 microseconds of overhead at 5,000 requests per second, so the observability layer does not become a bottleneck in high-throughput production systems.

Every request log includes the provider called, the model selected, the virtual key that authenticated the request, the prompt and completion token counts, the end-to-end latency, the calculated cost, and the response status. This gives a unified record across all providers in a consistent schema without any instrumentation in application code.

Per-virtual-key cost and token usage is available from the moment a key is created. Because every request carries the virtual key identity, the dashboard can break down total spend, daily token consumption, and request volume by consumer without any post-processing. A team that owns a specific virtual key can see exactly what it has spent across all providers it is authorized to reach.

Provider error rates are tracked continuously. When a provider returns a 5xx or a rate limit error and automatic fallback kicks in, that event is recorded against the originating virtual key and the failing provider, so teams can see both the user-facing impact and the provider-level instability over time. Semantic cache hit rates are surfaced alongside request counts, showing exactly how much token spend cache hits are avoiding. Rate limit consumption per key is tracked in real time against the configured rate limits, so teams can see how close any consumer is to hitting its ceiling before requests start failing.

Exporting Metrics to Existing Observability Stacks

Bifrost does not require teams to adopt a new monitoring platform. It exposes metrics in the formats that production observability stacks already consume.

The Prometheus metrics endpoint is available out of the box. Configure your Prometheus scrape interval (every 15 or 30 seconds is typical) and all request counts, token totals, latency histograms, error rates, and cache hit rates flow into existing Grafana dashboards or any other Prometheus-compatible visualization layer. Alert rules can be written against these metrics using standard PromQL without any vendor-specific query language.

OpenTelemetry / OTLP export sends traces and metrics to Grafana Tempo, Honeycomb, New Relic, Jaeger, or any OTLP-compatible backend. Each LLM request becomes a trace span with all relevant attributes attached, so AI calls appear in the same distributed tracing view as the rest of the application stack. A slow LLM response shows up in the same waterfall as the database query and the API call it is paired with.

The Datadog connector maps Bifrost's request telemetry to Datadog's LLM Observability schema and APM trace format. Token usage and cost attribution land in the LLM Observability dashboard without any custom mapping, and the connector links LLM spans to the broader APM traces they belong to so teams can see the full request context.

Log exports ship raw request logs to S3, GCS, BigQuery, or other data lakes for long-term retention, custom analytics, and compliance archiving. Teams that need 12-month retention for audit purposes or want to run cost attribution queries against historical data can route logs to their preferred storage layer without any additional middleware.

Tracing AI Spend Per Team, Application, and Model

Virtual keys are the mechanism that makes spend attribution work without any changes to application code. Each consumer, whether a team, a specific application, or an individual user, gets its own virtual key. All requests made with that key are tagged with the key identity at the gateway layer, so every token count and cost figure in the observability data carries a consumer identifier.

The result is that per-team and per-application spend dashboards are available without requiring each team to implement its own logging or tagging logic. When a new model is added or a new provider is onboarded, the attribution still works because the virtual key is validated and recorded by the gateway before the request is forwarded to any provider.

Model-level breakdowns follow from the same data. Because the gateway records the exact model used on every call, it is straightforward to see which models drive the most token spend, which are called most frequently, and which carry the highest per-request cost. Comparisons across providers for equivalent models, such as GPT-4o versus Claude Sonnet, are available in a single view without cross-referencing two billing consoles.

Alerting on AI Spend and Error Rates

Collecting observability data is only useful if it drives action before problems compound. The metrics Bifrost exports map directly to alert conditions that matter for production AI systems.

Per-key budget limits prevent overspend before it happens. When a virtual key approaches its monthly or daily token budget, the gateway enforces the limit rather than passing the request to the provider. Pair budget limits with a Prometheus alert on the budget consumption metric and the on-call team gets a notification before the limit is hit, with enough lead time to adjust the budget or investigate the usage spike.

Rate limit error spikes (HTTP 429s from providers) indicate provider-side throttling that may not yet be visible in provider dashboards. Alerting on the rate limit error rate from Bifrost's Prometheus metrics catches throttling events in real time. Fallback activation frequency is a related signal: a sudden increase in the number of requests that trigger automatic fallback indicates provider instability, even if the provider's own status page has not been updated.

P99 latency anomalies are detectable from Bifrost's latency histograms. A provider-side degradation that doubles median response time will appear in the latency percentile metrics minutes before it surfaces in user-facing error rates. Datadog monitors, Prometheus alerting rules, and OTLP-based alerting in Honeycomb or New Relic all consume these metrics natively, so teams can set up latency and error rate alerts using the same tooling they already use for the rest of the stack.

Compliance Audit Logging for AI Requests

Audit logs in Bifrost are immutable records of every request that passed through the gateway: which virtual key made the request, which provider and model handled it, the timestamp, the token counts, and the response status. These records support SOC 2 Type II evidence collection, HIPAA access logging requirements, ISO 27001 operational records, and GDPR data processing records without requiring any additional instrumentation in application code.

For organizations with long-term retention requirements, log exports ship audit logs to S3, GCS, BigQuery, or other data lakes where they can be retained for the duration required by the applicable compliance program. The export schema is consistent and documented, so compliance and security teams can write queries against historical request data without needing to understand provider-specific log formats. Bifrost's enterprise capabilities cover the full audit and governance surface that security-conscious organizations require.

Get Full Visibility with Bifrost

Organizations running AI in production across multiple providers need request-level observability, per-consumer attribution, and real-time alerting. Bifrost provides all of this from a single gateway, with export to Prometheus, OpenTelemetry, Datadog, and S3-compatible storage, so teams can see exactly what their AI systems are doing without building custom aggregation pipelines.

To see how Bifrost fits into your observability stack, schedule a demo with the team, or explore the benchmarks and governance resources to understand the full scope of what the gateway covers.