Rate limits and headers
How rate limiting works, what headers mean, and how to handle every response scenario.
The Luma Agents API enforces rate limits to ensure reliable service for all users. This guide explains how the limits work, what every response header means, and how to build integrations that handle rate limits gracefully.
Rate limit types
Section titled “Rate limit types”The API enforces two independent rate limits on POST /v1/generations:
| Limit | Description |
|---|---|
| Requests per minute (RPM) | Maximum number of generation requests per 60-second sliding window |
| Concurrent jobs | Maximum number of active (non-terminal) generations at any time |
Both limits are evaluated per API client. A request must pass both checks to succeed — exceeding either one returns HTTP 429.
How the RPM limit works
Section titled “How the RPM limit works”The RPM limit uses a sliding window algorithm. Each request is timestamped, and the API counts how many requests occurred in the last 60 seconds. There is no fixed reset boundary — the window slides continuously.
Example timeline (assuming a 30 RPM allowance for illustration):
Time Requests in window Outcome───── ────────────────── ──────────────────────────12:00 1 Allowed (29 remaining)12:00 2 Allowed (28 remaining)12:00 3 Allowed (27 remaining)...12:00 30 Allowed (0 remaining)12:00 31 429 — Rate limit exceeded12:01 30 429 — first request at 12:00 hasn't aged out yet12:01 29 Allowed — oldest request aged out of 60s windowBecause this is a sliding window, you don’t get a full refill at a fixed boundary. Requests age out individually as they pass the 60-second mark.
Response headers
Section titled “Response headers”On every response
Section titled “On every response”These headers are included on all responses — success or error, on both endpoints:
| Header | Example | Description |
|---|---|---|
X-Request-Id | 550e8400-e29b-41d4-a716-446655440000 | Unique request identifier. Echoes your X-Request-Id header if provided, otherwise server-generated UUID |
X-API-Version | 2026-04-01 | The API version that processed this request |
On successful POST /v1/generations (HTTP 201)
Section titled “On successful POST /v1/generations (HTTP 201)”In addition to the standard headers, successful generation submissions include rate limit headers:
| Header | Example | Description |
|---|---|---|
X-RateLimit-Limit | 30 | Your maximum RPM allowance |
X-RateLimit-Remaining | 17 | Requests remaining in the current sliding window |
X-RateLimit-Reset | 1712592060 | Unix timestamp when the current window ends (now + 60 seconds) |
Full response headers example:
HTTP/1.1 201 CreatedContent-Type: application/jsonX-Request-Id: 550e8400-e29b-41d4-a716-446655440000X-API-Version: 2026-04-01X-RateLimit-Limit: 30X-RateLimit-Remaining: 17X-RateLimit-Reset: 1712592060On rate limit exceeded (HTTP 429 — RPM)
Section titled “On rate limit exceeded (HTTP 429 — RPM)”When the RPM limit is exceeded, the response includes all rate limit headers plus Retry-After:
| Header | Example | Description |
|---|---|---|
Retry-After | 12 | Seconds until the oldest request in the window expires, freeing a slot. Minimum value: 1 |
X-RateLimit-Limit | 30 | Your maximum RPM allowance |
X-RateLimit-Remaining | 0 | Always 0 when rate limited |
X-RateLimit-Reset | 1712592012 | Unix timestamp when the oldest request ages out |
Full response example:
HTTP/1.1 429 Too Many RequestsContent-Type: application/jsonRetry-After: 12X-RateLimit-Limit: 30X-RateLimit-Remaining: 0X-RateLimit-Reset: 1712592012X-Request-Id: 661f9500-f30c-52e5-b827-557766551111X-API-Version: 2026-04-01{ "detail": "Rate limit exceeded"}How Retry-After is calculated: The server finds the oldest request in your 60-second window and computes how many seconds remain until it ages out. This is the minimum wait time before a slot opens. The value is always at least 1 second.
On concurrent job limit exceeded (HTTP 429)
Section titled “On concurrent job limit exceeded (HTTP 429)”When you have too many active generations:
HTTP/1.1 429 Too Many RequestsContent-Type: application/jsonRetry-After: 60X-Request-Id: 772a0611-a41d-63f6-c938-668877662222X-API-Version: 2026-04-01{ "detail": "Too many concurrent jobs"}Note the differences from RPM rate limiting:
detailsays"Too many concurrent jobs"instead of"Rate limit exceeded"Retry-Afteris fixed at60seconds (not dynamically computed)- No
X-RateLimit-*headers (those only apply to RPM)
How to tell them apart: Check the detail field in the response body.
detail value | Limit type | What to do |
|---|---|---|
"Rate limit exceeded" | RPM | Wait Retry-After seconds, then retry |
"Too many concurrent jobs" | Concurrent | Wait for an active generation to complete, then retry |
On GET /v1/generations/{id} (HTTP 200)
Section titled “On GET /v1/generations/{id} (HTTP 200)”The GET endpoint returns standard headers only — no rate limit headers:
HTTP/1.1 200 OKContent-Type: application/jsonX-Request-Id: 883b1722-b52e-74a7-d049-779988773333X-API-Version: 2026-04-01Understanding X-RateLimit-Remaining
Section titled “Understanding X-RateLimit-Remaining”The X-RateLimit-Remaining value tells you how many more requests you can make before hitting the RPM limit. Because the window is sliding, this number can both decrease (as you make requests) and increase (as old requests age out).
Example over time:
12:00:00 POST → 201 X-RateLimit-Remaining: 29 (made 1st request)12:00:01 POST → 201 X-RateLimit-Remaining: 28 (made 2nd request)12:00:02 POST → 201 X-RateLimit-Remaining: 27 (made 3rd request)...no requests for 58 seconds...12:00:59 POST → 201 X-RateLimit-Remaining: 28 (1st request aged out, made new one)12:01:00 POST → 201 X-RateLimit-Remaining: 28 (2nd request aged out, made new one)12:01:01 POST → 201 X-RateLimit-Remaining: 28 (3rd request aged out, made new one)Understanding X-RateLimit-Reset
Section titled “Understanding X-RateLimit-Reset”X-RateLimit-Reset is the Unix timestamp when the current window ends. Since this is a sliding window, this value equals “now + 60 seconds” — it represents when your current requests will start aging out, not a fixed global reset time.
Using it in code:
import time
def seconds_until_reset(headers): reset_ts = int(headers.get("X-RateLimit-Reset", 0)) return max(0, reset_ts - int(time.time()))Understanding Retry-After
Section titled “Understanding Retry-After”Retry-After tells you the minimum number of seconds to wait before a slot opens. It’s calculated as:
Retry-After = (oldest_request_timestamp + 60) - nowThis is always at least 1 second. After waiting this long, exactly one slot opens. If you need multiple slots, you may need to wait longer.
Concurrent jobs
Section titled “Concurrent jobs”A generation is considered an “active job” from the moment it’s submitted until it reaches a terminal state (completed or failed). Active jobs are tracked for up to 1 hour, after which they are automatically pruned.
Lifecycle of a concurrent job slot:
POST /generations → 201 ← Job slot consumedGET /generations/{id} → "queued" ← Slot still heldGET /generations/{id} → "processing" ← Slot still heldGET /generations/{id} → "completed" ← Slot released (or "failed" → slot released)If all your concurrent slots are occupied, you must wait for at least one generation to reach a terminal state before submitting a new one. The API does not queue excess requests — they are rejected immediately with HTTP 429.
Fail-open behavior
Section titled “Fail-open behavior”If the rate limiting infrastructure (Redis) is temporarily unavailable, the API allows requests through rather than blocking them. This is a deliberate design choice — rate limiting is a fairness mechanism, not a security boundary.
You may occasionally observe requests succeeding without rate limit headers when this happens. Do not rely on this behavior.
Retry strategies
Section titled “Retry strategies”Basic: respect Retry-After
Section titled “Basic: respect Retry-After”The simplest approach — wait exactly as long as the server says:
import timefrom luma_agents import RateLimitError
def create_generation(client, **kwargs): try: return client.generations.create(**kwargs) except RateLimitError as e: retry_after = int(e.response.headers.get("Retry-After", 5)) time.sleep(retry_after) return client.generations.create(**kwargs)import Luma, { RateLimitError } from "luma-agents";
async function createGeneration(client: Luma, params: Luma.GenerationCreateParams) { try { return await client.generations.create(params); } catch (e) { if (!(e instanceof RateLimitError)) throw e; const retryAfter = parseInt(e.headers?.get("retry-after") ?? "5", 10); await new Promise((r) => setTimeout(r, retryAfter * 1000)); return await client.generations.create(params); }}generation, err := client.Generations.New(ctx, params)if err != nil { var apiErr *lumaagents.Error if errors.As(err, &apiErr) && apiErr.StatusCode == 429 { retryAfter, _ := strconv.Atoi(apiErr.Response.Header.Get("Retry-After")) if retryAfter == 0 { retryAfter = 5 } time.Sleep(time.Duration(retryAfter) * time.Second) generation, err = client.Generations.New(ctx, params) }}Recommended: exponential backoff with jitter
Section titled “Recommended: exponential backoff with jitter”For production systems handling multiple 429 scenarios:
import randomimport timefrom luma_agents import RateLimitError
def create_with_backoff(client, max_retries=5, **kwargs): for attempt in range(max_retries): try: return client.generations.create(**kwargs) except RateLimitError as e: if attempt == max_retries - 1: raise
# Prefer Retry-After; fall back to exponential backoff. Always add jitter. retry_after = e.response.headers.get("Retry-After") base = int(retry_after) if retry_after else 2 ** attempt time.sleep(base + random.uniform(0, 1))import Luma, { RateLimitError, Generation } from "luma-agents";
async function createWithBackoff( client: Luma, params: Luma.GenerationCreateParams, maxRetries = 5,): Promise<Generation> { for (let attempt = 0; attempt < maxRetries; attempt++) { try { return await client.generations.create(params); } catch (e) { if (!(e instanceof RateLimitError) || attempt === maxRetries - 1) throw e;
const retryAfter = e.headers?.get("retry-after"); const base = retryAfter ? parseInt(retryAfter, 10) * 1000 : 2 ** attempt * 1000; await new Promise((r) => setTimeout(r, base + Math.random() * 1000)); } } throw new Error("unreachable");}func createWithBackoff(ctx context.Context, client *lumaagents.Client, params lumaagents.GenerationNewParams, maxRetries int) (*lumaagents.Generation, error) { for attempt := 0; attempt < maxRetries; attempt++ { generation, err := client.Generations.New(ctx, params) if err == nil { return generation, nil }
var apiErr *lumaagents.Error if !errors.As(err, &apiErr) || apiErr.StatusCode != 429 || attempt == maxRetries-1 { return nil, err }
retryAfter := apiErr.Response.Header.Get("Retry-After") var wait time.Duration if retryAfter != "" { secs, _ := strconv.Atoi(retryAfter) wait = time.Duration(secs)*time.Second + time.Duration(rand.Intn(1000))*time.Millisecond } else { wait = time.Duration(1<<attempt)*time.Second + time.Duration(rand.Intn(1000))*time.Millisecond } time.Sleep(wait) } return nil, fmt.Errorf("max retries exceeded")}Advanced: proactive throttling
Section titled “Advanced: proactive throttling”Don’t wait for 429 — slow down proactively as you approach the limit:
import time
class RateLimitedClient: def __init__(self, client): self.client = client self._remaining = None self._reset_at = None
def create(self, **kwargs): # If we know we're close to the limit, wait proactively if self._remaining is not None and self._remaining <= 2: if self._reset_at is not None: wait = max(0, self._reset_at - time.time()) if wait > 0: time.sleep(wait)
generation = self.client.generations.create(**kwargs)
# Update tracking from response headers (SDK-specific) # self._remaining = int(response.headers["X-RateLimit-Remaining"]) # self._reset_at = int(response.headers["X-RateLimit-Reset"])
return generationHandling concurrent job limits
Section titled “Handling concurrent job limits”For the concurrent job limit, the strategy is different — you need to wait for existing jobs to complete, not just wait a fixed duration:
import timefrom luma_agents import RateLimitError
def create_respecting_concurrency(client, max_wait=300, **kwargs): deadline = time.time() + max_wait
while time.time() < deadline: try: return client.generations.create(**kwargs) except RateLimitError as e: detail = (e.body or {}).get("detail", "") if isinstance(e.body, dict) else ""
if "concurrent" in detail.lower(): # Wait for an in-flight job to finish before retrying time.sleep(5) else: # RPM limit — use Retry-After retry_after = int(e.response.headers.get("Retry-After", 5)) time.sleep(retry_after)
raise TimeoutError("Could not submit generation within timeout")Using X-Request-Id for tracing
Section titled “Using X-Request-Id for tracing”Sending your own request ID
Section titled “Sending your own request ID”Include X-Request-Id in your request to trace it through your system:
curl -X POST https://agents.lumalabs.ai/v1/generations \ -H "Authorization: Bearer $LUMA_AGENTS_API_KEY" \ -H "Content-Type: application/json" \ -H "X-Request-Id: myapp-user42-batch7-req003" \ -d '{"prompt": "A sunset over the ocean"}'The server echoes your value back in the response:
X-Request-Id: myapp-user42-batch7-req003Server-generated request IDs
Section titled “Server-generated request IDs”If you don’t send X-Request-Id, the server generates a UUID:
X-Request-Id: 550e8400-e29b-41d4-a716-446655440000Best practices for request tracing
Section titled “Best practices for request tracing”- Generate unique IDs per attempt — If you retry a failed request, use a new request ID so you can distinguish attempts in logs
- Include context in the ID — Encode user ID, batch ID, or job type for easier debugging:
myapp-user42-batch7-req003 - Log the response ID — Always capture the
X-Request-Idfrom responses for support escalation - Quote it in support requests — When contacting support about a failed request, include the
X-Request-Idto enable fast lookup
X-API-Version header
Section titled “X-API-Version header”The X-API-Version header is returned on all responses to /v1/* endpoints. It indicates which API version processed your request.
| Current value | Meaning |
|---|---|
2026-04-01 | Initial API version |
This header is informational. You do not need to send a version header in your requests — the API currently has a single version.
Header summary by scenario
Section titled “Header summary by scenario”POST /v1/generations — Success (201)
Section titled “POST /v1/generations — Success (201)”| Header | Present | Example |
|---|---|---|
X-Request-Id | Always | 550e8400-... |
X-API-Version | Always | 2026-04-01 |
X-RateLimit-Limit | Always | 30 |
X-RateLimit-Remaining | Always | 17 |
X-RateLimit-Reset | Always | 1712592060 |
POST /v1/generations — RPM Rate Limited (429)
Section titled “POST /v1/generations — RPM Rate Limited (429)”| Header | Present | Example |
|---|---|---|
X-Request-Id | Always | 550e8400-... |
X-API-Version | Always | 2026-04-01 |
Retry-After | Always | 12 |
X-RateLimit-Limit | Always | 30 |
X-RateLimit-Remaining | Always | 0 |
X-RateLimit-Reset | Always | 1712592012 |
POST /v1/generations — Concurrent Job Limited (429)
Section titled “POST /v1/generations — Concurrent Job Limited (429)”| Header | Present | Example |
|---|---|---|
X-Request-Id | Always | 550e8400-... |
X-API-Version | Always | 2026-04-01 |
Retry-After | Always | 60 |
X-RateLimit-Limit | No | — |
X-RateLimit-Remaining | No | — |
X-RateLimit-Reset | No | — |
POST /v1/generations — Other errors (400, 401, 402, 403, 413, 422, 502, 503)
Section titled “POST /v1/generations — Other errors (400, 401, 402, 403, 413, 422, 502, 503)”| Header | Present | Example |
|---|---|---|
X-Request-Id | Always | 550e8400-... |
X-API-Version | Always | 2026-04-01 |
X-RateLimit-* | No | — |
Retry-After | No | — |
GET /v1/generations/{id} — All responses (200, 401, 404)
Section titled “GET /v1/generations/{id} — All responses (200, 401, 404)”| Header | Present | Example |
|---|---|---|
X-Request-Id | Always | 550e8400-... |
X-API-Version | Always | 2026-04-01 |
X-RateLimit-* | No | — |
Retry-After | No | — |