IT AI Skill
Api Rate Limiting Throttling
Design and implement API rate limiting, throttling, and request governance strategies to protect backend systems from abuse, ensure fair usage, manage capacity, and maintain service-level quality. Use when configuring rate limiting policies, implementing AP...
API Rate Limiting & Throttling
Protect APIs from abuse, ensure fair usage across consumers, manage backend capacity, and maintain service quality through comprehensive rate limiting, throttling, and request governance strategies.
Workflow
- Define rate limiting requirements per API: assess endpoint criticality, expected traffic volume, backend capacity, fairness requirements, and business tier models.
- Select rate limiting algorithms: fixed window, sliding window, token bucket, leaky bucket, or adaptive; choose based on use case (burst tolerance, precision, distributed environment).
- Implement rate limiting infrastructure: API gateway layer (Kong, Apigee, AWS API Gateway), service mesh layer (Istio, Linkerd), application layer (middleware), or distributed store (Redis-based).
- Design tiered rate limit plans: free tier, basic, professional, enterprise; align with pricing and SLA commitments; document quotas clearly.
- Configure rate limit headers and responses: Return-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset; proper 429 Too Many Requests responses with Retry-After header.
- Implement throttling strategies: gradual degradation (not hard cutoff); priority queuing for premium tiers; adaptive throttling based on backend health.
- Set up monitoring and alerting: track rate limit hit rates, throttled request volumes, abuse patterns, tier plan utilization; alert on anomalies.
- Handle rate limit exceptions: whitelist trusted IPs/partners; burst allowance for legitimate spikes; dynamic limit adjustment for planned campaigns.
- Communicate rate limits to API consumers: developer documentation, SDK integration (retry logic with exponential backoff), dashboard showing remaining quota.
- Review and adjust quarterly: analyze rate limit effectiveness, adjust limits based on traffic patterns, update tier plans, address consumer feedback.
Rate Limiting Algorithms
RATE LIMITING ALGORITHMS COMPARISON
=====================================
FIXED WINDOW COUNTER:
Mechanism:
→ Divide time into fixed windows (e.g., 1-minute, 1-hour, 1-day)
→ Count requests within each window
→ Reject when count exceeds limit
→ Counter resets at window boundary
Example: 100 requests per minute
→ Window: 10:00:00 to 10:00:59
→ Requests 1-100: allowed
→ Request 101+: rejected (429)
→ At 10:01:00: counter resets to 0
Advantages:
→ Simplest algorithm to implement
→ Low memory usage (one counter per client per window)
→ Easy to understand and explain to API consumers
Disadvantages:
→ Boundary burst problem: client can send 2x limit at window boundary
(100 at 9:59:59 + 100 at 10:00:01 = 200 in 2 seconds)
→ Not suitable for APIs requiring smooth traffic distribution
→ Precision limited to window granularity
Best for: Daily/monthly quotas; simple usage tracking; non-critical APIs
Implementation: Redis INCR + EXPIRE; in-memory counter with TTL
SLIDING LOG WINDOW:
Mechanism:
→ Record timestamp of every request
→ Count requests within the sliding time window
→ Remove timestamps older than window duration
→ Reject when count exceeds limit
Example: 100 requests per minute (sliding)
→ At 10:00:30: check requests since 9:59:30
→ Exactly 60-second window sliding forward continuously
→ No boundary burst problem
Advantages:
→ Precise enforcement (exact time window)
→ No boundary burst vulnerability
→ Fair enforcement across time boundaries
Disadvantages:
→ Higher memory usage (store timestamp per request)
→ More complex implementation
→ Performance overhead for high-QPS APIs (timestamp storage and cleanup)
Best for: APIs where precision matters; per-second rate limits; abuse prevention
Implementation: Redis sorted set (ZADD with timestamp score; ZREMRANGEBYSCORE for old entries)
TOKEN BUCKET:
Mechanism:
→ Bucket holds tokens (up to max capacity)
→ Tokens added at fixed rate (e.g., 10 tokens/second)
→ Each request consumes one or more tokens
→ Request rejected if insufficient tokens
→ Bucket never exceeds max capacity
Example: 10 tokens/sec, bucket size 100
→ Bucket starts full (100 tokens)
→ Burst of 100 requests: all allowed (bucket drains to 0)
→ Next request: must wait for token refill (100ms for 1 token)
→ Sustained rate: 10 requests/second after burst consumed
Advantages:
→ Allows controlled bursting (bucket full = burst capacity)
→ Smooth sustained rate enforcement
→ Flexible: different request types can consume different token counts
→ Widely adopted (industry standard for API rate limiting)
Disadvantages:
→ Burst allowance may overwhelm backend if not sized correctly
→ Requires continuous token refill mechanism (timer or calculated)
→ More complex than fixed window
Best for: General API rate limiting; APIs needing burst tolerance; tiered rate limiting
Implementation: Redis (store token count + last refill timestamp); calculate tokens on each request
LEAKY BUCKET:
Mechanism:
→ Requests enter a queue (bucket) at variable rate
→ Requests processed (leak) at fixed rate
→ Queue overflow = request rejected
→ Output rate is constant regardless of input rate
Example: Process 10 requests/sec, queue size 50
→ 100 requests arrive in 1 second
→ 10 processed immediately; 40 queued; 50 rejected (overflow)
→ Queue drains at 10/sec over next 4 seconds
Advantages:
→ Smooth, constant output rate (protects backend from traffic spikes)
→ Simple queue-based model
→ Good for downstream rate protection
Disadvantages:
→ No burst allowance (even if backend has spare capacity)
→ Increased latency (requests wait in queue)
→ Queue management overhead in distributed systems
Best for: Backend protection; database query rate limiting; smoothing traffic to downstream services
Implementation: Queue-based processing; fixed-rate worker; reject on queue full
ADAPTIVE / DYNAMIC RATE LIMITING:
Mechanism:
→ Rate limits adjusted based on real-time backend health metrics
→ Reduce limits when backend under stress (high CPU, latency, errors)
→ Increase limits when backend has spare capacity
→ Machine learning models predict optimal rate limits
Example: Backend CPU at 80% → rate limits reduced by 30%
Backend CPU at 40% → rate limits restored to normal
Advantages:
→ Protects backend from overload dynamically
→ Maximizes throughput during low-utilization periods
→ Responds to real-time conditions
Disadvantages:
→ Complex to implement and tune
→ Unpredictable limits for API consumers (limits change dynamically)
→ Requires comprehensive backend monitoring
→ Risk of limit flapping (oscillation)
Best for: APIs with variable backend capacity; auto-scaling environments; cost-optimized cloud deployments
Implementation: Real-time metrics → rate limit calculator → dynamic config update (consul, etcd, API gateway)
ALGORITHM SELECTION GUIDE:
Use Case | Recommended Algorithm
──────────────────────────────────|─────────────────────
General API rate limiting | Token Bucket
Simple daily/monthly quotas | Fixed Window
Abuse prevention (precision) | Sliding Log Window
Backend protection (smoothing) | Leaky Bucket
Variable capacity environments | Adaptive
Distributed rate limiting | Token Bucket + Redis
Per-user fair queuing | Sliding Log + Priority Queue
Tiered Rate Limit Plans
TIERED API RATE LIMIT STRUCTURE
==================================
FREE TIER:
Purpose: Developer testing, evaluation, low-volume personal projects
Limits:
→ Requests per second: 10
→ Requests per minute: 500
→ Requests per day: 50,000
→ Requests per month: 1,000,000
→ Burst allowance: 20 requests (token bucket size)
→ Concurrent connections: 5
→ Payload size: 1 MB max per request
Features:
→ Read-only endpoints (no write/mutation access)
→ Cached responses (not real-time data)
→ Community support only
→ No SLA commitment
→ Rate limit headers included in responses
Monetization: Lead generation for paid tiers; viral marketing through developer adoption
BASIC TIER ($49/month):
Purpose: Small applications, startups, internal tools
Limits:
→ Requests per second: 50
→ Requests per minute: 3,000
→ Requests per day: 500,000
→ Requests per month: 10,000,000
→ Burst allowance: 100 requests
→ Concurrent connections: 20
→ Payload size: 5 MB max per request
Features:
→ Read and write access to standard endpoints
→ Real-time data (not cached)
→ Email support (48-hour response SLA)
→ 99.5% uptime SLA
→ Webhook notifications for quota warnings (80%, 90%, 100%)
Throttling Behavior:
→ Soft throttle at 80%: warning headers; requests still processed
→ Hard throttle at 100%: 429 responses; Retry-After header included
PROFESSIONAL TIER ($299/month):
Purpose: Production applications, growing businesses
Limits:
→ Requests per second: 200
→ Requests per minute: 12,000
→ Requests per day: 2,000,000
→ Requests per month: 50,000,000
→ Burst allowance: 500 requests
→ Concurrent connections: 50
→ Payload size: 10 MB max per request
Features:
→ Full API access including premium endpoints
→ Real-time data with priority processing
→ Priority support (4-hour response SLA)
→ 99.9% uptime SLA
→ Dedicated rate limit dashboard
→ Custom webhook integrations
→ Bulk operations (batch endpoints with higher limits)
Throttling Behavior:
→ Priority queuing: Professional requests processed before Basic/Free
→ Graceful degradation: At capacity, free/basic throttled first
ENTERPRISE TIER (Custom Pricing):
Purpose: Large-scale applications, enterprise integrations, high-volume needs
Limits:
→ Requests per second: 1,000+ (negotiated)
→ Requests per minute: 60,000+ (negotiated)
→ Requests per day: Unlimited (negotiated)
→ Burst allowance: 5,000+ (negotiated)
→ Concurrent connections: 200+ (negotiated)
→ Payload size: 50 MB+ (negotiated)
Features:
→ Dedicated API infrastructure (isolated backend)
→ Custom rate limit profiles per endpoint
→ Dedicated account manager and support (1-hour response SLA)
→ 99.99% uptime SLA with financial penalties
→ Custom burst profiles for campaigns/events
→ Priority queuing (processed before all other tiers)
→ API usage analytics and forecasting
→ Custom SLA and penalty clauses
Throttling Behavior:
→ Never throttled under normal conditions
→ Only throttled during extreme system events (security incidents, catastrophic failures)
→ Advance notification before any throttling
Rate Limit Response Standards
RATE LIMIT RESPONSE SPECIFICATION
===================================
HTTP HEADERS (Included in EVERY API Response):
X-RateLimit-Limit: 1000
→ Maximum requests allowed in the current window
X-RateLimit-Remaining: 847
→ Requests remaining in the current window
X-RateLimit-Reset: 1705312200
→ Unix timestamp when the current window resets
Retry-After: 60 (Only on 429 responses)
→ Seconds to wait before retrying
HTTP STATUS CODES:
200 OK (with rate limit headers):
→ Request processed successfully
→ Headers show remaining quota
→ Client should track remaining quota and adjust request rate
429 Too Many Requests:
→ Rate limit exceeded
→ Response body includes:
{
"error": "rate_limit_exceeded",
"message": "You have exceeded your rate limit of 1000 requests per hour.",
"retry_after": 60,
"documentation": "https://api.example.com/docs/rate-limits"
}
→ Retry-After header included (seconds until next request allowed)
→ Client should implement exponential backoff with jitter
503 Service Unavailable (during adaptive throttling):
→ System under maintenance or overload
→ Retry-After header included
→ Different from 429 (system issue, not client rate limit)
CLIENT-SIDE RETRY STRATEGY (Recommended):
Exponential Backoff with Jitter:
→ 1st retry: wait 1s + random(0-1s)
→ 2nd retry: wait 2s + random(0-2s)
→ 3rd retry: wait 4s + random(0-4s)
→ 4th retry: wait 8s + random(0-8s)
→ 5th retry: wait 16s + random(0-16s)
→ Max retries: 5 (then fail with informative error)
→ Formula: wait = min(max_wait, base_delay * 2^attempt + random(0, base_delay * 2^attempt))
SDK Integration:
→ Built-in retry logic with exponential backoff
→ Automatic 429 handling
→ Rate limit header parsing (warn developer when approaching limits)
→ Configurable retry policy
Integration Points
- Kong Gateway: Rate limiting plugin; configurable per-route, per-service, per-consumer; Redis-backed for distributed limiting; token bucket and counter algorithms; $500/month (enterprise)
- AWS API Gateway: Built-in usage plans and API keys; request-based throttling (per-second, burst); stage-level and method-level limits; CloudWatch metrics; included in API Gateway pricing
- Apigee: Rate quota policy; developer app quotas; tier-based plans; analytics; policy-based enforcement; $25,000/year minimum
- NGINX Plus: limit_req module (leaky bucket); limit_conn module (connection limiting); per-IP, per-IP+API key; $1,500/year/node
- Redis: Distributed rate limiting backend; INCR+EXPIRE (fixed window); sorted sets (sliding window); Lua scripts for atomic operations; $0.07/hr (Managed) or self-hosted
- Envoy Proxy: Local rate limiting service; external rate limiting via Redis; per-virtual-host, per-route; part of service mesh (Istio, Linkerd)
- Cloudflare: Rate limiting rules; per-IP, per-API key; WAF integration; bot management; $5/month (Workers) to $200/month (Business)
- Azure API Management: Rate limiting by subscription; quota enforcement; IP filtering; policy-based configuration; $410/month (Developer tier)
Edge Cases
- Distributed rate limiting consistency: Multiple API gateway instances → rate limit state must be shared; solution: Redis cluster as centralized counter; eventual acceptance of minor over-limit (few requests); avoid per-instance limiting (defeats purpose)
- Burst traffic from legitimate sources: Marketing campaigns, product launches cause legitimate spikes; solution: pre-arranged temporary limit increases; burst token allowance in token bucket; dedicated enterprise tier for planned events
- API key sharing (single key, multiple users): Shared key reaches limit faster; solution: unique API keys per application/developer; monitor for anomalous usage patterns suggesting key sharing; enforce key scoping
- Global API with regional rate limits: Users in different regions should have separate limits; solution: rate limit per-region (geo-based routing + regional counters); document regional limits clearly; consider global + regional combined limits
- Webhook rate limiting: Webhook delivery failures cause retries; could overwhelm recipient; solution: separate rate limit for outgoing webhooks; exponential backoff for retries; dead letter queue for persistent failures; recipient health checking
- Real-time APIs (WebSocket, Server-Sent Events): Rate limiting per connection not per request; solution: limit message rate per connection; limit connections per client; timeout idle connections; message size limits
- Legacy API consumers unable to handle 429: Older clients may crash on 429 responses; solution: gradual enforcement (warn before enforce); SDK migration path; backward-compatible throttling (delayed response instead of 429 during transition period)