---
name: api-rate-limiting-throttling
description: Design and implement API rate limiting, throttling, and request governance strategies to protect backend systems from abuse, ensure fair usage, manage capacity, and maintain service-level quality. Use when configuring rate limiting policies, implementing API quotas, setting up throttling algorithms, managing API abuse, designing tiered API access plans, or protecting APIs from DDoS and scraping. Triggers on phrases like "rate limiting", "API throttling", "request quota", "API abuse", "rate limit policy", "throttle algorithm", "token bucket", "sliding window", "API governance", "API quota management", "API tier plan", "429 Too Many Requests".
---

# API Rate Limiting & Throttling

Protect APIs from abuse, ensure fair usage across consumers, manage backend capacity, and maintain service quality through comprehensive rate limiting, throttling, and request governance strategies.

## Workflow

1. Define rate limiting requirements per API: assess endpoint criticality, expected traffic volume, backend capacity, fairness requirements, and business tier models.
2. Select rate limiting algorithms: fixed window, sliding window, token bucket, leaky bucket, or adaptive; choose based on use case (burst tolerance, precision, distributed environment).
3. Implement rate limiting infrastructure: API gateway layer (Kong, Apigee, AWS API Gateway), service mesh layer (Istio, Linkerd), application layer (middleware), or distributed store (Redis-based).
4. Design tiered rate limit plans: free tier, basic, professional, enterprise; align with pricing and SLA commitments; document quotas clearly.
5. Configure rate limit headers and responses: Return-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset; proper 429 Too Many Requests responses with Retry-After header.
6. Implement throttling strategies: gradual degradation (not hard cutoff); priority queuing for premium tiers; adaptive throttling based on backend health.
7. Set up monitoring and alerting: track rate limit hit rates, throttled request volumes, abuse patterns, tier plan utilization; alert on anomalies.
8. Handle rate limit exceptions: whitelist trusted IPs/partners; burst allowance for legitimate spikes; dynamic limit adjustment for planned campaigns.
9. Communicate rate limits to API consumers: developer documentation, SDK integration (retry logic with exponential backoff), dashboard showing remaining quota.
10. Review and adjust quarterly: analyze rate limit effectiveness, adjust limits based on traffic patterns, update tier plans, address consumer feedback.

## Rate Limiting Algorithms

```
RATE LIMITING ALGORITHMS COMPARISON
=====================================

FIXED WINDOW COUNTER:

  Mechanism:
    → Divide time into fixed windows (e.g., 1-minute, 1-hour, 1-day)
    → Count requests within each window
    → Reject when count exceeds limit
    → Counter resets at window boundary

  Example: 100 requests per minute
    → Window: 10:00:00 to 10:00:59
    → Requests 1-100: allowed
    → Request 101+: rejected (429)
    → At 10:01:00: counter resets to 0

  Advantages:
    → Simplest algorithm to implement
    → Low memory usage (one counter per client per window)
    → Easy to understand and explain to API consumers

  Disadvantages:
    → Boundary burst problem: client can send 2x limit at window boundary
       (100 at 9:59:59 + 100 at 10:00:01 = 200 in 2 seconds)
    → Not suitable for APIs requiring smooth traffic distribution
    → Precision limited to window granularity

  Best for: Daily/monthly quotas; simple usage tracking; non-critical APIs
  Implementation: Redis INCR + EXPIRE; in-memory counter with TTL

SLIDING LOG WINDOW:

  Mechanism:
    → Record timestamp of every request
    → Count requests within the sliding time window
    → Remove timestamps older than window duration
    → Reject when count exceeds limit

  Example: 100 requests per minute (sliding)
    → At 10:00:30: check requests since 9:59:30
    → Exactly 60-second window sliding forward continuously
    → No boundary burst problem

  Advantages:
    → Precise enforcement (exact time window)
    → No boundary burst vulnerability
    → Fair enforcement across time boundaries

  Disadvantages:
    → Higher memory usage (store timestamp per request)
    → More complex implementation
    → Performance overhead for high-QPS APIs (timestamp storage and cleanup)

  Best for: APIs where precision matters; per-second rate limits; abuse prevention
  Implementation: Redis sorted set (ZADD with timestamp score; ZREMRANGEBYSCORE for old entries)

TOKEN BUCKET:

  Mechanism:
    → Bucket holds tokens (up to max capacity)
    → Tokens added at fixed rate (e.g., 10 tokens/second)
    → Each request consumes one or more tokens
    → Request rejected if insufficient tokens
    → Bucket never exceeds max capacity

  Example: 10 tokens/sec, bucket size 100
    → Bucket starts full (100 tokens)
    → Burst of 100 requests: all allowed (bucket drains to 0)
    → Next request: must wait for token refill (100ms for 1 token)
    → Sustained rate: 10 requests/second after burst consumed

  Advantages:
    → Allows controlled bursting (bucket full = burst capacity)
    → Smooth sustained rate enforcement
    → Flexible: different request types can consume different token counts
    → Widely adopted (industry standard for API rate limiting)

  Disadvantages:
    → Burst allowance may overwhelm backend if not sized correctly
    → Requires continuous token refill mechanism (timer or calculated)
    → More complex than fixed window

  Best for: General API rate limiting; APIs needing burst tolerance; tiered rate limiting
  Implementation: Redis (store token count + last refill timestamp); calculate tokens on each request

LEAKY BUCKET:

  Mechanism:
    → Requests enter a queue (bucket) at variable rate
    → Requests processed (leak) at fixed rate
    → Queue overflow = request rejected
    → Output rate is constant regardless of input rate

  Example: Process 10 requests/sec, queue size 50
    → 100 requests arrive in 1 second
    → 10 processed immediately; 40 queued; 50 rejected (overflow)
    → Queue drains at 10/sec over next 4 seconds

  Advantages:
    → Smooth, constant output rate (protects backend from traffic spikes)
    → Simple queue-based model
    → Good for downstream rate protection

  Disadvantages:
    → No burst allowance (even if backend has spare capacity)
    → Increased latency (requests wait in queue)
    → Queue management overhead in distributed systems

  Best for: Backend protection; database query rate limiting; smoothing traffic to downstream services
  Implementation: Queue-based processing; fixed-rate worker; reject on queue full

ADAPTIVE / DYNAMIC RATE LIMITING:

  Mechanism:
    → Rate limits adjusted based on real-time backend health metrics
    → Reduce limits when backend under stress (high CPU, latency, errors)
    → Increase limits when backend has spare capacity
    → Machine learning models predict optimal rate limits

  Example: Backend CPU at 80% → rate limits reduced by 30%
          Backend CPU at 40% → rate limits restored to normal

  Advantages:
    → Protects backend from overload dynamically
    → Maximizes throughput during low-utilization periods
    → Responds to real-time conditions

  Disadvantages:
    → Complex to implement and tune
    → Unpredictable limits for API consumers (limits change dynamically)
    → Requires comprehensive backend monitoring
    → Risk of limit flapping (oscillation)

  Best for: APIs with variable backend capacity; auto-scaling environments; cost-optimized cloud deployments
  Implementation: Real-time metrics → rate limit calculator → dynamic config update (consul, etcd, API gateway)

ALGORITHM SELECTION GUIDE:

  Use Case                          | Recommended Algorithm
  ──────────────────────────────────|─────────────────────
  General API rate limiting         | Token Bucket
  Simple daily/monthly quotas       | Fixed Window
  Abuse prevention (precision)      | Sliding Log Window
  Backend protection (smoothing)    | Leaky Bucket
  Variable capacity environments    | Adaptive
  Distributed rate limiting         | Token Bucket + Redis
  Per-user fair queuing             | Sliding Log + Priority Queue
```

## Tiered Rate Limit Plans

```
TIERED API RATE LIMIT STRUCTURE
==================================

FREE TIER:

  Purpose: Developer testing, evaluation, low-volume personal projects
  Limits:
    → Requests per second: 10
    → Requests per minute: 500
    → Requests per day: 50,000
    → Requests per month: 1,000,000
    → Burst allowance: 20 requests (token bucket size)
    → Concurrent connections: 5
    → Payload size: 1 MB max per request
  Features:
    → Read-only endpoints (no write/mutation access)
    → Cached responses (not real-time data)
    → Community support only
    → No SLA commitment
    → Rate limit headers included in responses
  Monetization: Lead generation for paid tiers; viral marketing through developer adoption

BASIC TIER ($49/month):

  Purpose: Small applications, startups, internal tools
  Limits:
    → Requests per second: 50
    → Requests per minute: 3,000
    → Requests per day: 500,000
    → Requests per month: 10,000,000
    → Burst allowance: 100 requests
    → Concurrent connections: 20
    → Payload size: 5 MB max per request
  Features:
    → Read and write access to standard endpoints
    → Real-time data (not cached)
    → Email support (48-hour response SLA)
    → 99.5% uptime SLA
    → Webhook notifications for quota warnings (80%, 90%, 100%)
  Throttling Behavior:
    → Soft throttle at 80%: warning headers; requests still processed
    → Hard throttle at 100%: 429 responses; Retry-After header included

PROFESSIONAL TIER ($299/month):

  Purpose: Production applications, growing businesses
  Limits:
    → Requests per second: 200
    → Requests per minute: 12,000
    → Requests per day: 2,000,000
    → Requests per month: 50,000,000
    → Burst allowance: 500 requests
    → Concurrent connections: 50
    → Payload size: 10 MB max per request
  Features:
    → Full API access including premium endpoints
    → Real-time data with priority processing
    → Priority support (4-hour response SLA)
    → 99.9% uptime SLA
    → Dedicated rate limit dashboard
    → Custom webhook integrations
    → Bulk operations (batch endpoints with higher limits)
  Throttling Behavior:
    → Priority queuing: Professional requests processed before Basic/Free
    → Graceful degradation: At capacity, free/basic throttled first

ENTERPRISE TIER (Custom Pricing):

  Purpose: Large-scale applications, enterprise integrations, high-volume needs
  Limits:
    → Requests per second: 1,000+ (negotiated)
    → Requests per minute: 60,000+ (negotiated)
    → Requests per day: Unlimited (negotiated)
    → Burst allowance: 5,000+ (negotiated)
    → Concurrent connections: 200+ (negotiated)
    → Payload size: 50 MB+ (negotiated)
  Features:
    → Dedicated API infrastructure (isolated backend)
    → Custom rate limit profiles per endpoint
    → Dedicated account manager and support (1-hour response SLA)
    → 99.99% uptime SLA with financial penalties
    → Custom burst profiles for campaigns/events
    → Priority queuing (processed before all other tiers)
    → API usage analytics and forecasting
    → Custom SLA and penalty clauses
  Throttling Behavior:
    → Never throttled under normal conditions
    → Only throttled during extreme system events (security incidents, catastrophic failures)
    → Advance notification before any throttling
```

## Rate Limit Response Standards

```
RATE LIMIT RESPONSE SPECIFICATION
===================================

HTTP HEADERS (Included in EVERY API Response):

  X-RateLimit-Limit: 1000
    → Maximum requests allowed in the current window

  X-RateLimit-Remaining: 847
    → Requests remaining in the current window

  X-RateLimit-Reset: 1705312200
    → Unix timestamp when the current window resets

  Retry-After: 60 (Only on 429 responses)
    → Seconds to wait before retrying

HTTP STATUS CODES:

  200 OK (with rate limit headers):
    → Request processed successfully
    → Headers show remaining quota
    → Client should track remaining quota and adjust request rate

  429 Too Many Requests:
    → Rate limit exceeded
    → Response body includes:
      {
        "error": "rate_limit_exceeded",
        "message": "You have exceeded your rate limit of 1000 requests per hour.",
        "retry_after": 60,
        "documentation": "https://api.example.com/docs/rate-limits"
      }
    → Retry-After header included (seconds until next request allowed)
    → Client should implement exponential backoff with jitter

  503 Service Unavailable (during adaptive throttling):
    → System under maintenance or overload
    → Retry-After header included
    → Different from 429 (system issue, not client rate limit)

CLIENT-SIDE RETRY STRATEGY (Recommended):

  Exponential Backoff with Jitter:
    → 1st retry: wait 1s + random(0-1s)
    → 2nd retry: wait 2s + random(0-2s)
    → 3rd retry: wait 4s + random(0-4s)
    → 4th retry: wait 8s + random(0-8s)
    → 5th retry: wait 16s + random(0-16s)
    → Max retries: 5 (then fail with informative error)
    → Formula: wait = min(max_wait, base_delay * 2^attempt + random(0, base_delay * 2^attempt))

  SDK Integration:
    → Built-in retry logic with exponential backoff
    → Automatic 429 handling
    → Rate limit header parsing (warn developer when approaching limits)
    → Configurable retry policy
```

## Integration Points

- **Kong Gateway**: Rate limiting plugin; configurable per-route, per-service, per-consumer; Redis-backed for distributed limiting; token bucket and counter algorithms; $500/month (enterprise)
- **AWS API Gateway**: Built-in usage plans and API keys; request-based throttling (per-second, burst); stage-level and method-level limits; CloudWatch metrics; included in API Gateway pricing
- **Apigee**: Rate quota policy; developer app quotas; tier-based plans; analytics; policy-based enforcement; $25,000/year minimum
- **NGINX Plus**: limit_req module (leaky bucket); limit_conn module (connection limiting); per-IP, per-IP+API key; $1,500/year/node
- **Redis**: Distributed rate limiting backend; INCR+EXPIRE (fixed window); sorted sets (sliding window); Lua scripts for atomic operations; $0.07/hr (Managed) or self-hosted
- **Envoy Proxy**: Local rate limiting service; external rate limiting via Redis; per-virtual-host, per-route; part of service mesh (Istio, Linkerd)
- **Cloudflare**: Rate limiting rules; per-IP, per-API key; WAF integration; bot management; $5/month (Workers) to $200/month (Business)
- **Azure API Management**: Rate limiting by subscription; quota enforcement; IP filtering; policy-based configuration; $410/month (Developer tier)

## Edge Cases

- **Distributed rate limiting consistency**: Multiple API gateway instances → rate limit state must be shared; solution: Redis cluster as centralized counter; eventual acceptance of minor over-limit (few requests); avoid per-instance limiting (defeats purpose)
- **Burst traffic from legitimate sources**: Marketing campaigns, product launches cause legitimate spikes; solution: pre-arranged temporary limit increases; burst token allowance in token bucket; dedicated enterprise tier for planned events
- **API key sharing (single key, multiple users)**: Shared key reaches limit faster; solution: unique API keys per application/developer; monitor for anomalous usage patterns suggesting key sharing; enforce key scoping
- **Global API with regional rate limits**: Users in different regions should have separate limits; solution: rate limit per-region (geo-based routing + regional counters); document regional limits clearly; consider global + regional combined limits
- **Webhook rate limiting**: Webhook delivery failures cause retries; could overwhelm recipient; solution: separate rate limit for outgoing webhooks; exponential backoff for retries; dead letter queue for persistent failures; recipient health checking
- **Real-time APIs (WebSocket, Server-Sent Events)**: Rate limiting per connection not per request; solution: limit message rate per connection; limit connections per client; timeout idle connections; message size limits
- **Legacy API consumers unable to handle 429**: Older clients may crash on 429 responses; solution: gradual enforcement (warn before enforce); SDK migration path; backward-compatible throttling (delayed response instead of 429 during transition period)
