Try Bifrost Enterprise free for 14 days. Request access

Managing LLM Traffic: Understanding and Applying Rate Limits

Managing LLM Traffic: Understanding and Applying Rate Limits
LLM rate limits cap the AI traffic you can send before a 429 error stops it. Bifrost enforces provider limits and your own governance limits from one gateway.

Every LLM provider caps how many requests and tokens an account can send per minute, and crossing that cap returns a 429 Too Many Requests error that halts the request. As teams move AI features from prototype to production, these LLM rate limits become one of the most frequent sources of failed requests, latency spikes, and stalled agent runs. Bifrost, the open-source AI gateway built in Go by Maxim AI, is designed for enterprise teams that need to manage this traffic centrally: it absorbs provider rate limits through retries and failover, and it lets you apply your own request and token limits across every model, key, and team from a single control point. This post explains how rate limits work, why they break production traffic, and how to apply them cleanly at the gateway layer.

What Are LLM Rate Limits?

LLM rate limits are provider-enforced ceilings on how much traffic an account can send in a fixed time window, measured mainly in requests per minute (RPM) and tokens per minute (TPM). When a request would exceed either ceiling, the provider rejects it with a 429 status instead of processing it.

Most providers enforce several dimensions at once, and any one of them can trip first:

  • Requests per minute (RPM): the raw count of API calls in a rolling 60-second window, regardless of size.
  • Tokens per minute (TPM): the combined input and output tokens processed per minute. A single long-context call can consume most of a TPM budget on its own.
  • Requests per day (RPD) and tokens per day (TPD): daily ceilings, most common on free and lower tiers.

These limits are typically set at the organization level and scale with account tier. OpenAI documents RPM, TPM, RPD, and TPD dimensions that increase as cumulative spend moves an account up its usage tiers. Anthropic measures the Claude API in RPM, input tokens per minute, and output tokens per minute, returning a 429 with a retry-after header when a rate limit is exceeded. Azure OpenAI defines TPM and RPM quotas per region, per subscription, and per model. The exact numbers differ, but the failure mode is the same across providers.

Managing these ceilings well is a core part of AI traffic governance, and it involves two separate jobs: staying under the limits providers set, and enforcing the limits you set on your own consumers.

Why Rate Limits Break Production AI Traffic

Rate limits rarely cause problems in development, where traffic is light and predictable. They surface in production, where concurrency is high and traffic is bursty. Three properties make them especially disruptive for AI teams.

First, limits are usually shared across the whole organization. Every API key under one account draws from the same RPM and TPM pool, so adding more keys does not add capacity, and a background batch job can starve interactive user traffic without either team noticing until requests start failing.

Second, token limits are hard to predict. RPM depends only on call count, but TPM depends on prompt length plus generated output. A conversation that grows longer, a larger uploaded document, or a higher max_tokens setting can push a request over the TPM ceiling even when the request count is well within RPM.

Third, naive retries make the problem worse. Failed requests often still count against the quota, so a client that immediately resends after a 429 burns more budget while the window is still saturated. Without coordinated backoff and failover, a brief spike turns into a sustained outage.

Two Sides of Rate Limiting: Provider Limits and Your Own Governance

Managing LLM traffic means dealing with rate limits in two directions at once, and treating them as separate problems is the key to handling both well.

  • Provider limits you must respect. These are the RPM and TPM ceilings imposed by OpenAI, Anthropic, Azure, and others. You cannot change them directly (you can only request higher tiers), so the job is to stay under them and recover gracefully when you hit them.
  • Governance limits you impose. These are the caps you set on your own consumers: a per-team request ceiling, a per-project token budget, a per-model spending limit. The job here is enforcement, so that no single application, user, or experiment can exhaust shared capacity or run up unexpected cost.

An AI gateway sits between your applications and the providers, which is the natural place to solve both. Bifrost handles the provider side by pooling keys and failing over, and it handles the governance side by enforcing your own limits before a request ever leaves the gateway. The rest of this post covers each in turn.

How Bifrost Handles Upstream Provider Rate Limits

Bifrost absorbs provider rate limits through three coordinated mechanisms, so a 429 from one provider or key does not become a failed request for your application.

When a provider returns a 429, Bifrost classifies it as a per-key failure and rotates to a different API key from the pool, applying backoff because account-level quotas are often shared across keys. Its retries and fallbacks use exponential backoff with jitter for transient failures, following the pattern that OpenAI, Anthropic, and Azure all recommend for 429 handling. If every retry against the primary provider is exhausted, Bifrost moves to the next provider in your fallback chain, and each fallback provider gets its own full retry budget.

Underneath this, Bifrost distributes traffic across multiple keys with weighted load balancing, so you can assign higher weights to keys with better rate limits and spread requests before any single key approaches its ceiling. Because Bifrost unifies access to 1000+ models behind one OpenAI-compatible API, provider routing can send overflow traffic to an equivalent model on a different provider without any change to application code. This aggregation of independent provider pools is the most effective way to push effective throughput past any one account's limit.

Two supporting features reduce how often you hit provider limits at all. Semantic caching replays answers for identical or semantically similar requests, cutting the request and token volume that reaches the provider. Bifrost adds only 11 microseconds of overhead per request at 5,000 requests per second, so this resilience layer does not become a bottleneck of its own.

Applying Rate Limits with Bifrost Governance

Beyond absorbing provider limits, Bifrost lets you apply your own rate limits to control internal traffic. Rate limiting is part of Bifrost's governance system, managed through virtual keys, the primary access-control entity that each consumer uses in place of a raw provider key.

Bifrost supports two limit types that run in parallel, mirroring the RPM and TPM model that providers use:

  • Request limits: the maximum number of calls allowed within a reset window (for example, 1,000 requests per hour).
  • Token limits: the maximum prompt-plus-completion tokens allowed within a reset window (for example, one million tokens per hour).

Both types can be set at two levels through budget and rate limits: the virtual key itself, and each provider config attached to that key. Limits are checked hierarchically, provider config first and then virtual key, and a request must pass both to proceed. If one provider config exceeds its rate limit, that provider is excluded from routing while other providers on the same key stay available, so a limit on one upstream does not take down the request. Reset windows are flexible, from 1m and 1h for throttling up to 1d, 1w, 1M, and 1Y for longer cycles.

Configuring a per-provider rate limit through the API looks like this:

curl -X POST "<https://your-bifrost-instance.com/api/governance/virtual-keys>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "marketing-team-vk",
    "provider_configs": [
      {
        "provider": "openai",
        "weight": 0.7,
        "rate_limit": {
          "token_max_limit": 1000000,
          "token_reset_duration": "1h",
          "request_max_limit": 1000,
          "request_reset_duration": "1h"
        }
      }
    ]
  }'

For finer control, model limits provide a single interface to set rate limits keyed on a specific model, an optional provider, and a scope. A global scope caps a model across all traffic, while a virtual_key scope caps it for one consumer, which lets platform teams protect a scarce or expensive model without touching everything else. Together, virtual keys, provider configs, and model limits give you the same request-and-token vocabulary the providers use, applied on your terms. The full pattern is covered on the Bifrost governance resource page.

Best Practices for Managing LLM Traffic at Scale

The following practices apply whether you run Bifrost as open source or on Bifrost Enterprise for high-availability, multi-node deployments.

Should I retry on a 429 or fail over to another provider?

Do both, in order. Retry the same provider with exponential backoff and jitter first, since many 429 conditions clear within seconds as the window slides. If retries are exhausted, fail over to an equivalent model on another provider so the request still completes. A gateway lets you configure this once rather than in every service.

How do I stop one team from consuming all the shared quota?

Give each team or project its own virtual key with request and token limits sized to its role. Because Bifrost checks limits before a request reaches the provider, an over-active consumer is throttled at the gateway and never eats into the shared organization quota that interactive traffic depends on.

How do I reduce rate-limit errors without upgrading my tier?

Lower the volume that reaches the provider. Semantic caching removes redundant calls, weighted load balancing spreads traffic across keys, and pre-sizing max_tokens to expected output length keeps token counts from inflating your TPM usage. Pair these with observability so you can see which limit (requests or tokens) is actually binding before you change anything.

Getting Started with Bifrost

LLM rate limits are unavoidable, but failed requests are not. By absorbing provider limits through retries, key rotation, and failover, and by letting you apply request and token limits across every consumer, Bifrost turns rate limiting from a recurring production incident into a configured, observable part of your infrastructure. Managing LLM traffic this way scales from a single application to an enterprise fleet without rewriting application code. You can explore more patterns across the Bifrost resources hub.

To see how the Bifrost AI gateway can handle rate limits and governance for your AI traffic, book a demo with the Bifrost team.