How to Handle Rate Limits in an MCP Server or price them away.

Agents retry harder than humans ever will. Classic rate limiting keeps your MCP server alive — and per-call pricing often does the same job with less code and better incentives.

An MCP server exposed to agents faces a traffic pattern humans never produce: a single client can fire dozens of tool calls per second, retry on every transient error, and loop on the same query when a prompt goes sideways. Without limits, one misbehaving agent can take your server down or run up your upstream API bill.

You have two levers. The traditional one is rejecting requests past a threshold. The newer one is making each request cost money, which turns abuse from your problem into the caller's. Most production servers end up using both.

Know which transport you're limiting

A stdio MCP server runs as a child process of one client, so 'rate limiting' there is really resource limiting — debounce expensive tools, cache repeated calls, and cap concurrency inside the process. There is no shared capacity to protect.

Remote servers over Streamable HTTP are where real rate limiting lives. Many clients share one deployment, requests arrive as ordinary HTTP POSTs, and you can apply the same middleware you would use on any API — keyed by session ID, API key, or paying wallet rather than by IP, which NAT and cloud egress make unreliable.

A token bucket at the MCP endpoint

The workhorse is a token bucket per caller: each identity gets a refill rate and a burst allowance, and requests past that get a clean error the client can back off from. Keep the rejection inside the JSON-RPC error channel where you can — agents handle a structured tool error far more gracefully than a dropped connection.

rate-limit.ts

const buckets = new Map<string, { tokens: number; last: number }>();

function allow(key: string, ratePerSec = 5, burst = 10): boolean {
  const now = Date.now();
  const b = buckets.get(key) ?? { tokens: burst, last: now };
  b.tokens = Math.min(burst, b.tokens + ((now - b.last) / 1000) * ratePerSec);
  b.last = now;
  if (b.tokens < 1) return false;
  b.tokens -= 1;
  buckets.set(key, b);
  return true;
}

app.post("/mcp", (req, res, next) => {
  const key = req.header("mcp-session-id") ?? req.ip;
  if (!allow(key)) return res.status(429).json({ error: "rate_limited", retryAfter: 1 });
  next();
});

Why hard limits fit agents badly

Rate limits encode a guess: 'no legitimate caller needs more than N per minute.' Agents break the guess constantly. A research agent fanning out fifty parallel lookups is legitimate; a stuck loop making one call per second is not. Thresholds cannot tell them apart, so you end up throttling your best users to defend against your worst.

Limits also create support burden — tiered allowances, quota-increase requests, keys passed around to dodge caps. All of that machinery exists to ration something you would happily sell.

Pricing as a rate limit

Put an x402 price on the tool call and the economics do the throttling. At even $0.01 per call — the minimum on Loomal — a runaway loop costs its operator real money within minutes, while the fan-out agent doing fifty useful calls just pays fifty cents and keeps going. The agent pays before your handler runs, settlement is USDC on Base in about two seconds, and there are no chargebacks to dispute later.

This inverts the incentive: callers self-limit because waste hits their wallet, and heavy use stops being abuse and starts being revenue.

Use both, in layers

Pricing does not protect you from a flood of unpaid 402 handshakes or from genuinely malicious traffic, so keep a generous safety ceiling — a token bucket set well above any plausible paid usage — in front of the payment layer. The bucket protects the process; the price allocates the capacity.

If your server is listed on Loomal, document both in the listing: agents and their operators plan around a per-call price plus a stated burst ceiling far better than around an unstated limit they discover via 429s.

FAQ

Should I rate limit by IP address?

Only as a last resort. Agent traffic frequently arrives from shared cloud egress IPs, so IP-based limits punish unrelated callers and barely slow a distributed one. Prefer the MCP session ID, an auth identity, or the paying wallet address from the x402 payment.

Can per-call pricing really replace rate limits?

It replaces the rationing function — deciding who gets capacity — because cost makes callers self-limit. It does not replace infrastructure protection, so keep a high safety ceiling against unpaid or malicious request floods. Price for allocation, limit for survival.

How do agents react when they hit a 429?

Well-built clients back off and retry; naive ones retry immediately and make things worse. Return a structured error with a Retry-After hint, and prefer failing inside the JSON-RPC response over dropping connections, since agents recover from tool errors more cleanly than from transport failures.

Where does Loomal fit into this?

Loomal gives your MCP server the payment layer that makes price-based limiting possible: a listing with a per-call price from $0.01, x402 settlement in USDC on Base, and signed receipts. The fee is 5% on settled transactions, currently waived.

Rate limiting Add x402 payments to a Node.js MCP server Pay-per-call vs subscription Streamable HTTP transport MCP server security best practices

See priced MCP servers in the wild.

Browse Loomal listings to see how real servers combine pricing with sane limits.

Browse the marketplace