# Designing a Rate Limiter From Scratch — Token Bucket vs Sliding Window vs Fixed Window vs Leaky Bucket

## Why Most Rate Limiter Articles Miss the Point

Search "rate limiter system design" and you'll find two kinds of articles.

The first kind gives you a surface-level overview of algorithms with no real implementation details. The second gives you a Redis snippet with no explanation of why that algorithm was chosen, what its failure modes are, or how it behaves under traffic spikes.

Neither prepares you for the real question — which algorithm do you actually use, and when?

This post covers all four major rate-limiting algorithms, compares them honestly, and shows you how to implement per-user and per-IP limiting in a distributed system. By the end, you'll have a decision framework you can apply to any project — not just a definition to recite in an interview.

* * *

## What Is Rate Limiting and Why Does It Matter?

Rate limiting controls how many requests a client can make to your system in a given time window.

Without it:

*   A single misbehaving client can exhaust your server resources
    
*   A credential stuffing attack tries thousands of passwords per second
    
*   A DDoS amplification attack overwhelms your upstream services
    
*   One heavy API consumer degrades the experience for everyone else
    

Rate limiting protects your system at the edge — before expensive business logic runs.

* * *

## The Four Algorithms

### 1\. Fixed Window Counter

**How it works:**

Divide time into fixed windows (e.g., every 60 seconds). Count requests per client per window. Reject once the count exceeds the limit.

```plaintext
Window: 12:00:00 → 12:01:00  |  Limit: 100 requests
 
Client A: ████████████████ 87 requests  ✅
Client B: ████████████████████████ 100 requests  ✅ (101st rejected)
```

**Implementation (Python + Redis):**

```python
import redis
import time
 
r = redis.Redis()
 
def is_allowed_fixed_window(client_id: str, limit: int, window_seconds: int) -> bool:
    window_key = int(time.time() // window_seconds)
    key = f"rate:{client_id}:{window_key}"
    
    count = r.incr(key)
    if count == 1:
        r.expire(key, window_seconds)
    
    return count <= limit
```

**Pros:**

*   Dead simple to implement
    
*   Very low memory — one counter per client per window
    
*   Easy to reason about and debug
    

**Cons — the boundary problem:**

This is the critical flaw. A client can make 100 requests at 12:00:59 and another 100 at 12:01:01 — 200 requests in 2 seconds, both within their respective windows but double the intended limit.

```plaintext
Window 1 ends ──────────────────┐
                                 │
                  100 requests ──┘└── 100 requests
                                 │
Window 2 starts ─────────────────┘
 
200 requests in ~2 seconds. Window boundary exploited.
```

**When to use it:** Internal admin tools, low-stakes APIs, anywhere simplicity matters more than precision.

* * *

### 2\. Sliding Window Counter

**How it works:**

A refinement of Fixed Window that weights the previous window's count based on how far through the current window you are.

```plaintext
Current count = 
  (previous_window_count × overlap_ratio) + current_window_count
```

**Example:**

You're 25% through the current window. Previous window had 80 requests. Current window has 30 so far.

```plaintext
Weighted count = (80 × 0.75) + 30 = 60 + 30 = 90
```

**Implementation:**

```python
def is_allowed_sliding_window(client_id: str, limit: int, window_seconds: int) -> bool:
    now = time.time()
    current_window = int(now // window_seconds)
    previous_window = current_window - 1
    
    # How far through the current window are we?
    elapsed = now % window_seconds
    overlap_ratio = 1 - (elapsed / window_seconds)
    
    current_key = f"rate:{client_id}:{current_window}"
    previous_key = f"rate:{client_id}:{previous_window}"
    
    current_count = int(r.get(current_key) or 0)
    previous_count = int(r.get(previous_key) or 0)
    
    weighted_count = (previous_count * overlap_ratio) + current_count
    
    if weighted_count >= limit:
        return False
    
    count = r.incr(current_key)
    if count == 1:
        r.expire(current_key, window_seconds * 2)
    
    return True
```

**Pros:**

*   Solves the boundary burst problem
    
*   Still memory-efficient — just two counters per client
    
*   Good approximation of true sliding window behaviour
    

**Cons:**

*   The weighting is an approximation — not perfectly accurate
    
*   Slightly harder to implement correctly
    
*   Edge cases around window boundaries need care
    

**When to use it:** Public APIs where you want fairness without the complexity of a full sliding window log.

* * *

### 3\. Token Bucket

**How it works:**

Each client has a bucket that holds tokens. Tokens refill at a constant rate up to a maximum capacity. Each request consumes one token. No token = request rejected.

```plaintext
Bucket capacity: 10 tokens
Refill rate: 2 tokens/second
 
t=0:  [██████████] 10 tokens
t=1:  Request → [█████████ ] 9 tokens
t=1:  Request → [████████  ] 8 tokens
t=2:  Refill  → [██████████] 10 tokens (capped at max)
```

**Implementation:**

```python
import time
 
def is_allowed_token_bucket(
    client_id: str,
    capacity: int,
    refill_rate: float  # tokens per second
) -> bool:
    key = f"token_bucket:{client_id}"
    now = time.time()
    
    # Lua script for atomic read-modify-write
    lua_script = """
    local key = KEYS[1]
    local capacity = tonumber(ARGV[1])
    local refill_rate = tonumber(ARGV[2])
    local now = tonumber(ARGV[3])
    
    local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
    local tokens = tonumber(bucket[1]) or capacity
    local last_refill = tonumber(bucket[2]) or now
    
    -- Add tokens based on time elapsed
    local elapsed = now - last_refill
    tokens = math.min(capacity, tokens + elapsed * refill_rate)
    
    if tokens < 1 then
        return 0  -- Rejected
    end
    
    tokens = tokens - 1
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, 3600)
    return 1  -- Allowed
    """
    
    result = r.eval(lua_script, 1, key, capacity, refill_rate, now)
    return result == 1
```

**Why Lua?** Redis executes Lua scripts atomically — essential here to prevent race conditions where two concurrent requests both read the same token count and both think they can proceed.

**Pros:**

*   Handles bursts naturally — a client can use saved-up tokens for a short burst
    
*   Smooth traffic shaping
    
*   Widely used in production (AWS API Gateway uses token bucket internally)
    

**Cons:**

*   Two parameters to tune (capacity + refill rate) — easy to misconfigure
    
*   Slightly more complex implementation
    
*   Burst behaviour can be surprising if you're not expecting it
    

**When to use it:** APIs where some bursting is acceptable — e.g. a user uploading a batch of files. The token bucket lets them burst briefly without being punished for it.

* * *

### 4\. Leaky Bucket

**How it works:**

Requests enter a queue (the bucket). They're processed at a constant rate regardless of how fast they arrive. If the queue is full, new requests are rejected.

```plaintext
Incoming requests (variable rate)
         │││││││││
         ▼▼▼▼▼▼▼▼▼
    ┌─────────────┐
    │  Queue      │ ← Max capacity: 10
    │  ▓▓▓▓▓▓▓   │
    └──────┬──────┘
           │ constant outflow
           ▼▼▼▼▼▼▼▼  (e.g. 5 req/sec)
```

**Implementation:**

```python
def is_allowed_leaky_bucket(
    client_id: str,
    capacity: int,
    leak_rate: float  # requests per second
) -> bool:
    key = f"leaky:{client_id}"
    now = time.time()
    
    lua_script = """
    local key = KEYS[1]
    local capacity = tonumber(ARGV[1])
    local leak_rate = tonumber(ARGV[2])
    local now = tonumber(ARGV[3])
    
    local bucket = redis.call('HMGET', key, 'queue_size', 'last_leak')
    local queue_size = tonumber(bucket[1]) or 0
    local last_leak = tonumber(bucket[2]) or now
    
    -- Drain the bucket based on elapsed time
    local elapsed = now - last_leak
    local leaked = math.floor(elapsed * leak_rate)
    queue_size = math.max(0, queue_size - leaked)
    
    if queue_size >= capacity then
        return 0  -- Bucket full, reject
    end
    
    queue_size = queue_size + 1
    redis.call('HMSET', key, 'queue_size', queue_size, 'last_leak', now)
    redis.call('EXPIRE', key, 3600)
    return 1  -- Allowed
    """
    
    result = r.eval(lua_script, 1, key, capacity, leak_rate, now)
    return result == 1
```

**Pros:**

*   Guarantees a perfectly smooth, consistent outflow rate
    
*   Protects downstream services from any burst whatsoever
    
*   Predictable processing rate
    

**Cons:**

*   Bursts are always rejected — even legitimate ones
    
*   Queue management adds latency
    
*   Less intuitive than token bucket
    

**When to use it:** Protecting downstream services that can't handle any variation in request rate — payment processors, third-party APIs with strict rate limits, legacy systems with fixed throughput.

* * *

## Algorithm Comparison

| Algorithm | Burst Handling | Memory | Accuracy | Implementation | Best For |
| --- | --- | --- | --- | --- | --- |
| Fixed Window | ❌ Boundary exploitable | ✅ Very low | ⚠️ Approximate | ✅ Simple | Internal tools |
| Sliding Window | ✅ Good | ✅ Low | ✅ Good | ⚠️ Moderate | Public APIs |
| Token Bucket | ✅ Allows bursts | ⚠️ Medium | ✅ Accurate | ⚠️ Moderate | User-facing APIs |
| Leaky Bucket | ❌ Smooths all bursts | ⚠️ Medium | ✅ Accurate | ⚠️ Moderate | Downstream protection |

* * *

## Per-User vs Per-IP Limiting

This is where most tutorials stop. Real systems need both.

### The Problem With IP-Only Limiting

*   Users behind corporate NAT share one IP — limit one, limit all
    
*   IPv6 makes IP-based blocking trivial to bypass (rotate addresses)
    
*   Mobile users switch IPs constantly
    

### The Problem With User-Only Limiting

*   Unauthenticated endpoints can't use user IDs
    
*   Login endpoints need IP limiting before a user ID is even known
    
*   Credential stuffing attacks rotate user accounts
    

### The Solution: Layered Limiting

Apply both, with different limits and algorithms for each layer:

```python
from fastapi import FastAPI, Request, HTTPException
from typing import Optional
 
app = FastAPI()
 
# Layer 1: IP-based (protects unauthenticated endpoints)
IP_LIMIT = 200        # requests per minute per IP
IP_WINDOW = 60
 
# Layer 2: User-based (protects authenticated endpoints)
USER_LIMIT = 100      # requests per minute per user
USER_WINDOW = 60
 
# Layer 3: Endpoint-specific (protects sensitive endpoints)
LOGIN_LIMIT = 5       # login attempts per minute per IP
LOGIN_WINDOW = 60
 
def get_client_ip(request: Request) -> str:
    # Respect X-Forwarded-For if behind a load balancer
    forwarded_for = request.headers.get("X-Forwarded-For")
    if forwarded_for:
        return forwarded_for.split(",")[0].strip()
    return request.client.host
 
async def rate_limit_middleware(
    request: Request,
    user_id: Optional[str] = None
):
    client_ip = get_client_ip(request)
    endpoint = request.url.path
    
    # Layer 1: Always check IP
    if not is_allowed_sliding_window(f"ip:{client_ip}", IP_LIMIT, IP_WINDOW):
        raise HTTPException(
            status_code=429,
            detail="Too many requests from this IP",
            headers={"Retry-After": "60"}
        )
    
    # Layer 2: Check user if authenticated
    if user_id:
        if not is_allowed_token_bucket(f"user:{user_id}", USER_LIMIT, USER_LIMIT/USER_WINDOW):
            raise HTTPException(
                status_code=429,
                detail="Rate limit exceeded. Please slow down.",
                headers={"Retry-After": "60"}
            )
    
    # Layer 3: Tighter limits for sensitive endpoints
    sensitive_endpoints = ["/auth/login", "/auth/register", "/auth/reset-password"]
    if endpoint in sensitive_endpoints:
        if not is_allowed_fixed_window(f"login:{client_ip}", LOGIN_LIMIT, LOGIN_WINDOW):
            raise HTTPException(
                status_code=429,
                detail="Too many authentication attempts",
                headers={"Retry-After": "60", "X-RateLimit-Limit": str(LOGIN_LIMIT)}
            )
 
 
@app.middleware("http")
async def apply_rate_limiting(request: Request, call_next):
    user_id = request.headers.get("X-User-ID")  # Set by auth middleware upstream
    await rate_limit_middleware(request, user_id)
    response = await call_next(request)
    return response
```

* * *

## Distributed Rate Limiting — The Hard Part

A single Redis instance works fine until you have multiple API servers. The challenge: requests from the same client hit different servers, each with their own in-memory state.

### Architecture

```plaintext
Client
  │
  ▼
Load Balancer
  ├── API Server 1 ──┐
  ├── API Server 2 ──┼──► Redis Cluster (shared state)
  └── API Server 3 ──┘
```

All servers share a single Redis cluster for rate limit state. This works well up to very high scale.

### Redis Cluster Considerations

*   Use **Redis Cluster** (not Sentinel) for horizontal scaling
    
*   Hash slot distribution means related keys (same client) land on the same node — consistent
    
*   Use **connection pooling** — don't open a new Redis connection per request
    

```python
# Production Redis setup with connection pooling
import redis
 
pool = redis.ConnectionPool(
    host='your-redis-cluster-endpoint',
    port=6379,
    max_connections=50,
    socket_connect_timeout=1,
    socket_timeout=1
)
 
r = redis.Redis(connection_pool=pool)
```

### What Happens When Redis Goes Down?

Two options and you need to decide upfront:

**Fail open** — if Redis is unavailable, allow all requests. Traffic flows, no revenue impact, risk of abuse.

**Fail closed** — if Redis is unavailable, reject all requests. No abuse, but service is degraded for legitimate users.

For most APIs, **fail open** with alerting is the right call. Add a circuit breaker:

```python
import time
 
redis_healthy = True
last_redis_check = 0
 
def is_allowed_with_fallback(client_id: str, limit: int, window: int) -> bool:
    global redis_healthy, last_redis_check
    
    # Check Redis health every 5 seconds
    if not redis_healthy and time.time() - last_redis_check > 5:
        try:
            r.ping()
            redis_healthy = True
        except:
            last_redis_check = time.time()
            return True  # Fail open
    
    try:
        result = is_allowed_sliding_window(client_id, limit, window)
        redis_healthy = True
        return result
    except Exception:
        redis_healthy = False
        last_redis_check = time.time()
        return True  # Fail open — allow request, log the issue
```

* * *

## Response Headers — Don't Forget These

Always return rate limit headers so clients can self-throttle:

```python
from fastapi import Response
 
def add_rate_limit_headers(
    response: Response,
    limit: int,
    remaining: int,
    reset_at: int  # Unix timestamp
):
    response.headers["X-RateLimit-Limit"] = str(limit)
    response.headers["X-RateLimit-Remaining"] = str(max(0, remaining))
    response.headers["X-RateLimit-Reset"] = str(reset_at)
    response.headers["Retry-After"] = str(reset_at - int(time.time()))
```

Well-behaved API clients use these headers to back off automatically — which means fewer retries hammering your system when limits are hit.

* * *

## Where to Put the Rate Limiter

You have three options:

**Option 1: API Gateway (AWS API Gateway / Kong)**

*   Easiest to set up — configure, not code
    
*   Limited flexibility — hard to do per-user logic
    
*   Good for baseline IP protection
    

**Option 2: Middleware in your application**

*   Full control over logic
    
*   Access to user context, request data, business rules
    
*   The implementation shown above
    

**Option 3: Dedicated rate limit service**

*   Centralised policy management across multiple services
    
*   More infrastructure to maintain
    
*   Worth it at large scale (10+ services)
    

For most teams: **start with middleware, move to a dedicated service when you have 5+ services that need a consistent rate-limiting policy.**

* * *

## Key Takeaways

1.  **Fixed Window** is fine for low-stakes internal APIs — don't over-engineer
    
2.  **Sliding Window** is the pragmatic choice for most public APIs — good accuracy, low cost
    
3.  **Token Bucket** is best when some bursting is acceptable and expected
    
4.  **Leaky Bucket** is for protecting downstream services that need perfectly smooth traffic
    
5.  **Layer your limits** — IP-level and user-level serve different purposes, use both
    
6.  **Always use Lua scripts in Redis** for rate limit operations — atomicity is not optional
    
7.  **Decide your Redis failure mode upfront** — fail open or fail closed, document it
    
8.  **Return rate limit headers** — good clients will use them and reduce your load
    

* * *

## What Would You Do Differently?

Every system has different constraints. Some teams need per-organisation limits, per-endpoint limits, or dynamic limits that change based on subscription tier.

What's the most interesting rate-limiting challenge you've tackled? Drop it in the comments — I read every one.

* * *

*If you found this useful, follow me on* [*LinkedIn*](https://www.linkedin.com/in/manbhasin) *where I post about system design, AWS, and engineering career growth every week.*