Rate Limiting

Rate limiting restricts how many requests a client can make to an API within a given time period, protecting the service from overload and abuse.

Also known as: throttling, request quotas

What is rate limiting?

Rate limiting caps the number of requests an API will accept from a given client — identified by API key, IP address, or account — within a time window: per second, per minute, per day. Requests beyond the cap are rejected, typically with HTTP 429 Too Many Requests and a Retry-After header indicating when to try again.

It exists for good reasons: protecting backends from overload, containing the blast radius of buggy or malicious clients, and enforcing the boundaries between pricing tiers.

How rate limits are implemented

The most common algorithm is the token bucket: each client's bucket refills at a steady rate and holds up to a maximum, so short bursts are allowed as long as the long-run average stays under the cap. Alternatives include fixed windows (simple but spiky at window boundaries) and sliding windows (smoother, slightly costlier to track).

Well-behaved APIs advertise their limits in response headers — remaining quota, reset time — so clients can pace themselves instead of discovering the wall by hitting it.

Rate limiting and AI agents

Agent traffic is bursty by nature: an agent decomposing a research task may fire twenty tool calls in ten seconds, then go quiet for an hour. Free tiers tuned for human-paced usage feel suffocating to that pattern — the agent burns through a daily quota in one task and then stalls, or worse, fails mid-task with a 429 it cannot negotiate around.

For MCP servers this is a recurring complaint: the quota that protects the operator from abuse also blocks the legitimate heavy use that would make the server genuinely valuable.

Quotas vs prices as the throttle

Rate limits and per-call prices are both ways of rationing access; they differ in who decides the cutoff. A quota is an arbitrary line set by the operator — generous for some workloads, crippling for others. A price lets the caller decide: an agent paying $0.01 per call via x402 can make exactly as many calls as the work justifies, throttled only by its own wallet budget. The operator's revenue rises with load rather than being capped by their own free tier.

The two are complements, not substitutes. A monetized server still needs operational limits against runaway clients and denial-of-service traffic — but the limit can sit far above normal usage, because payment, not the quota, is doing the economic rationing. The x402 design helps here too: the payment is verified before the handler runs, so unpaid floods never reach expensive code paths.

What to check on a listing

When evaluating an MCP server or API for agent workloads, look at the rate limits alongside the pricing model: a low hard cap matters more than a headline price for bursty workloads. Listings in the Loomal Index surface live-probed tool lists and pricing, and a per-call x402 listing generally signals that the budget — not a quota — is the practical ceiling. Check the server's own documentation for residual operational limits.

API Key Pay-Per-Call API HTTP 402 Payment Required API Endpoint