When agents hit your 429 without reset_at, things get bad fast
Without a machine-readable reset timestamp, exponential backoff with jitter across many agents clusters into retry storms. The fix: reset_at in JSON, Retry-After in headers, and contracts agents can depend on.
Full article excerpt tap to expand
// blog When agents hit your 429 without reset_at, things get bad fast 2026-04-28 · by Guilherme Secca Stable error contracts, reset_at, and Retry-After A 429 without a machine-readable reset timestamp doesn't just slow agents down. Under the right conditions it turns them into a coordinated attack on your own API. Think through what happens: 50 agents running generated code against the same SDK all hit a rate limit at roughly the same time. None of them gets a reset_at. So each one does the sensible thing and starts exponential backoff with jitter. Except "jitter" across agents running identical generated code tends to cluster. They back off, then they all retry at roughly the same time, then they all hit the limit again. The thundering herd is your own clients. You don't need pathological load for this — you just need enough agents running the same generated retry logic. This is what we designed around building Truval: API infrastructure built to be called by agents. Email verification is the first surface — more APIs will follow. The fix was straightforward: HTTP 429 (rate limit) { "error": "rate_limit_exceeded", "message": "Rate limit of 10 req/min exceeded.", "action": "Wait until reset_at (plus a small cushion) before retrying.", "limit": 10, "window": "1m", "reset_at": "2026-04-24T12:00:00.000Z", "docs": "https://docs.truval.dev/api/email-verify#rate-limits" } We also send Retry-After as an HTTP header — same idea, expressed as delta-seconds until reset_at — for clients that read it there. One edge case worth knowing: reset_at is a raw server-side timestamp with no built-in buffer. If a client's clock runs slightly ahead of ours, an early retry might land before the window resets and get another 429. The SDK adds 50ms; for generic clients without tight loops, 1-2s is a safe conservative default. Either way, sleep past reset_at rather than exactly on it — which is why the action field says "plus a small cushion." Now retry logic is just: sleep until reset_at (plus a small cushion), then retry once. No backoff math. No jitter. No storm. That one change is the most important thing I’d tell someone designing an API that agents will call. Everything else matters, but it’s downstream of “don’t make agents guess.” Agents depend on contracts, not docs When a developer integrates an API they read the docs, try things, adjust. The feedback loop is interactive. Agents don’t work that way. They ingest a machine-readable surface once, pick a path, and repeat it across every generated call forever. If anything in that path is ambiguous, they’ll invent a plausible answer. Wrong base URL, wrong auth header, a missing retry on a transient 503 that gets treated as permanent. The contract an agent actually depends on is: base URL, auth shape, OpenAPI spec, and a small stable set of error codes with typed fields. Not the prose. Not the getting started guide. For Truval this meant publishing three stable URLs and treating them like a public interface: https://api.truval.dev — the base, not moving GET https://api.truval.dev/openapi.json — source of truth for codegen POST https://mcp.truval.dev/mcp — tool surface for MCP-compatible hosts The OpenAPI spec is a compatibility boundary, same as a library interface. Renaming an operation casually is a breaking change. Adding a field is fine. Removing or renaming one is not. The other errors worth getting right Monthly quota hits need the same treatment as rate limits — reset_at, plus used and limit so…
This excerpt is published under fair use for community discussion. Read the full article at truval.dev.