Built-In Observability for Every LLM Request
Most LLM observability problems start with a gap in coverage: traces exist for some services, some providers, or some code paths, but not every request that leaves the application. When a team runs three or more LLM providers, instruments them with different SDKs, and ships changes weekly, the result is partial visibility into spend, latency, and failures. Bifrost, the open-source LLM gateway built in Go by Maxim AI, captures structured observability for every request that passes through it, with no per-service instrumentation and no impact on request latency. This post covers how the LLM gateway approach makes observability the default rather than a follow-up project, and how to wire Bifrost into existing monitoring stacks.
What Is Built-In LLM Observability
Built-in LLM observability is request-level tracing, metrics, and logging that the LLM gateway captures automatically for every model call, without requiring changes to application code. Instead of instrumenting each service that talks to a provider, teams route traffic through a single control point that records inputs, outputs, tokens, cost, and latency for every request and exports that data to existing tools.
LLM observability differs from traditional application monitoring in what it has to record. A single request carries a prompt, model parameters, a provider and model selection, token counts on both sides, a USD cost, and a latency profile that includes time to first token for streaming responses. Capturing this consistently across providers is the core challenge, and it is the reason gateway-level capture is more reliable than scattered per-service instrumentation.
The shift toward a shared standard is underway across the industry. The OpenTelemetry GenAI semantic conventions define a consistent vocabulary for LLM spans, including request model, input and output token usage, and finish reasons, so traces are comparable across frameworks and vendors. Bifrost emits traces in this format, which means observability data from the gateway lands in your backend already aligned with the standard.
Why Per-Request Observability Matters for AI Teams
Per-request observability matters because LLM spend, latency, and failure modes are difficult to predict and easy to miss in aggregate dashboards. A feature that costs a fraction of a cent per call becomes a five-figure monthly line item at scale, and the cost is invisible until someone breaks it down by user, feature, model, or provider. The same is true for latency regressions and silent error patterns that only appear when you can drill into individual traces.
The operational case for treating cost as a first-class metric is well documented. Engineering teams running inference in production report that latency, tokens per second, and cost are the three metrics that most directly drive infrastructure and budget decisions. When these are not captured for every request, teams optimize against averages and miss the requests that actually drive spend.
Gateway-level capture solves three problems that per-service instrumentation tends to leave open:
- Coverage gaps: every request through the gateway is recorded, so there is no provider or code path that quietly goes untraced.
- Inconsistent schemas: each provider SDK reports tokens and errors differently; the gateway normalizes them into one structure.
- Instrumentation drift: new services do not need to add tracing code, because observability is a property of the routing layer, not of each application.
Centralizing visibility this way is the same architectural argument that makes Bifrost a strong fit as a unified AI gateway for enterprises: one control point for routing, governance, and the observability data that comes with it.
How Bifrost Captures Observability for Every LLM Request
Bifrost includes built-in observability that automatically captures detailed information about every request and response flowing through the gateway. The logging plugin records input messages, model parameters, the provider and model that served the request, output messages and tool calls, token usage, cost, latency, and success or error status. This happens without changes to application code, because the capture occurs at the routing layer rather than inside each service.
The capture is asynchronous by design. Logs are written in background goroutines using pooled allocations, so recording observability data does not add latency to the request path. Bifrost adds roughly 11 microseconds of overhead per request at 5,000 requests per second in sustained benchmarks, and the logging plugin operates outside that hot path.
What Bifrost records for every request includes:
- Request data: complete input messages, temperature, max tokens, tools, and the provider and model that handled the call.
- Response data: output messages, tool call results, latency, token usage, and status.
- Retry and key selection: which API key served the request, the number of retries, and an ordered trail of every attempt with its failure reason.
- Custom metadata: configured request headers (for example a tenant ID) captured into log metadata for per-tenant filtering.
- Multimodal and tool support: audio transcription, vision inputs, and function calling arguments and results.
Because this data is structured and searchable, teams can debug a specific failed request, trace a multi-turn session, or analyze which model and provider combinations drive cost. For deeper visibility into where requests are routed and why, Bifrost also records the routing engine and rule that matched each call, which connects observability directly to governance and access control.
Exporting Observability Data to Your Stack
Bifrost exports observability data through open standards, so the gateway plugs into existing monitoring infrastructure rather than replacing it. Three export paths cover most production setups: OpenTelemetry for distributed tracing, Prometheus for metrics, and object storage for long-term log retention.
OpenTelemetry and OTLP tracing
Bifrost's OpenTelemetry integration sends LLM traces to any OTLP collector, including Grafana Cloud, New Relic, Honeycomb, and self-hosted collectors. Traces follow the OpenTelemetry GenAI semantic conventions, so LLM operations correlate with the rest of your application telemetry in the same backend. Requests can be tagged with a session ID and grouped into a single trace, which is useful for viewing a multi-turn conversation or an agent run as one unit.
The value of standards alignment here is practical. As OpenTelemetry's own analysis of GenAI observability describes, a shared convention for spans and attributes lets teams correlate model calls, agent steps, and tool execution without vendor-specific glue. Emitting traces in this format at the gateway means every request is observable in the tools a platform team already runs.
Prometheus metrics
Bifrost exposes native Prometheus metrics on a /metrics endpoint when the telemetry plugin is enabled, which is the default. Metrics cover request totals, success and error counts, upstream latency histograms, input and output token counters, cache hit rates, and total cost in USD, each labeled by provider, model, virtual key, and team. For multi-node deployments, Bifrost can push metrics to a Prometheus Push Gateway so aggregation stays accurate behind a load balancer. These metrics work directly with Grafana dashboards and alerting.
The telemetry layer also captures streaming-specific latency, including time to first token and inter-token latency, which are the metrics that matter most for user-facing response speed. As with logging, metrics collection is asynchronous and adds no latency to request processing.
Datadog and log exports
For teams standardized on Datadog, the Datadog connector provides native integration across three pillars: APM traces via dd-trace-go with W3C Trace Context, native LLM Observability, and operational metrics through DogStatsD or the Metrics API. It runs in agent mode through a local Datadog Agent or in agentless mode that sends data directly to Datadog's APIs.
For long-term retention, Bifrost supports log exports to object storage. Large request and response payloads stream to Amazon S3 or Google Cloud Storage while the logs database keeps searchable metadata, indexes, and pointers. This keeps the database small and fast and makes archived traffic cheap to retain and query from your own data lake.
Observability and Governance at Enterprise Scale
Built-in observability becomes more valuable when it is paired with the governance and deployment controls that enterprises require. Because Bifrost records the virtual key, team, and customer associated with every request, the same data that powers cost and latency dashboards also drives budget enforcement and per-tenant attribution. Observability and governance are two views of the same request stream.
For regulated environments, Bifrost supports in-VPC deployments so that observability data never leaves private infrastructure, along with audit logs that maintain immutable trails for SOC 2, GDPR, HIPAA, and ISO 27001 compliance. Teams that need to keep raw prompts out of long-term storage can exclude specific fields from log export while still retaining the metadata required for monitoring and cost analysis.
This positioning, the LLM gateway as a single point for routing, governance, and observability, is what makes Bifrost a fit for large teams and regulated industries. Maxim AI, the team behind Bifrost, also builds an evaluation and observability platform for AI agents, but at the infrastructure layer Bifrost is where every LLM request is routed, governed, and recorded.
How does gateway observability differ from SDK instrumentation?
Gateway observability captures data at the routing layer, so every request through the gateway is recorded regardless of which service or SDK sent it. SDK instrumentation requires adding tracing code to each application and tends to leave coverage gaps as new services ship.
Does built-in observability add latency to requests?
No. Bifrost's logging and telemetry plugins run asynchronously in background goroutines, so observability capture happens outside the request path and does not add latency. The gateway itself adds roughly 11 microseconds of overhead per request at 5,000 requests per second.
Can Bifrost send LLM traces to my existing observability tools?
Yes. Bifrost exports OpenTelemetry traces in the GenAI semantic convention format to any OTLP collector, exposes Prometheus metrics on a /metrics endpoint, and provides a native Datadog connector for APM traces, LLM Observability, and metrics.
Start Building with Bifrost
LLM observability does not have to be a separate instrumentation project layered on top of every service. With the LLM gateway approach, Bifrost captures tokens, cost, latency, and request traces for every model call by default, then exports that data through OpenTelemetry, Prometheus, Datadog, and object storage into the tools your team already runs. For platform teams evaluating gateways, the LLM Gateway Buyer's Guide and the Bifrost benchmarks detail the performance and capability profile in depth.
To see how built-in observability for every LLM request works in your environment, book a demo with the Bifrost team.