Skip to main content

Command Palette

Search for a command to run...

How I'd Design a Chat System Like WhatsApp — WebSockets, Message Delivery, and Scaling to Millions

A practical system design breakdown covering 1-on-1 and group messaging, delivery guarantees, and the architecture decisions that actually matter in production

Published
15 min read
How I'd Design a Chat System Like WhatsApp — WebSockets, Message Delivery, and Scaling to Millions

What Most Chat System Design Articles Get Wrong

Search "design a chat system" and you'll find hundreds of articles. Most of them give you a box labeled "Chat Server" with some arrows pointing at a database and call it a day.

That's not a design. That's a diagram with no decisions in it.

The real challenge in designing a chat system isn't the happy path — it's everything that goes wrong. What happens when a user goes offline mid-message? How do you guarantee a message sent to a group of 500 people actually reaches all of them? What does "delivered" even mean at scale?

This post covers the full architecture for a WhatsApp-style chat system supporting 1-on-1 and group messaging — with real decisions, real tradeoffs, and the three technical problems that will define your design: WebSockets, message delivery guarantees, and scaling to millions of concurrent users.


Requirements

Before touching architecture, let's be precise about what we're building.

Functional requirements:

  • 1-on-1 messaging between users

  • Group messaging (up to 500 members per group)

  • Message delivery receipts (sent, delivered, read)

  • Online/offline presence indicators

  • Message history persistence

  • Media sharing (images, files) — basic support

Non-functional requirements:

  • Low latency — messages should feel real-time (under 100ms end-to-end)

  • High availability — 99.99% uptime, chat can't go down

  • At-least-once delivery — no message ever silently lost

  • Eventual consistency — slight delay in delivery receipts is acceptable

  • Scale — 50 million daily active users, 100 messages per user per day = 5 billion messages per day

Let's use these constraints to drive every architectural decision.


The Naive Approach (And Why It Fails)

The obvious first attempt:

User A → HTTP POST /messages → Server → Database → Poll for new messages → User B

User B polls every few seconds asking "any new messages?" This works for email. It's catastrophic for chat.

At 50M users polling every 3 seconds:

  • 16.6 million requests per second — just for polling

  • Most responses are empty — wasted compute

  • Minimum latency is your polling interval — 3 seconds feels broken for chat

You need persistent connections. Enter WebSockets.


Part 1: WebSockets — The Foundation of Real-Time Messaging

How WebSockets Work

HTTP is request-response — the client always initiates. WebSockets upgrade an HTTP connection to a persistent, bidirectional channel. Once established, either side can send data at any time.

Client                          Server
  │                               │
  │── HTTP GET /chat ──────────►  │
  │   Upgrade: websocket          │
  │◄─ 101 Switching Protocols ──  │
  │                               │
  │  ◄──── persistent connection ────►  │
  │                               │
  │◄── {"type":"message",...} ──  │  (server pushes)
  │── {"type":"ack",...} ────────►│  (client responds)

The connection lifecycle:

# FastAPI WebSocket endpoint
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from typing import Dict
import json

app = FastAPI()

class ConnectionManager:
    def __init__(self):
        # user_id → WebSocket connection
        self.active_connections: Dict[str, WebSocket] = {}

    async def connect(self, user_id: str, websocket: WebSocket):
        await websocket.accept()
        self.active_connections[user_id] = websocket
        await self.update_presence(user_id, online=True)

    async def disconnect(self, user_id: str):
        self.active_connections.pop(user_id, None)
        await self.update_presence(user_id, online=False)

    async def send_to_user(self, user_id: str, message: dict) -> bool:
        websocket = self.active_connections.get(user_id)
        if websocket:
            await websocket.send_json(message)
            return True
        return False  # User offline — message needs queuing

    async def update_presence(self, user_id: str, online: bool):
        # Publish to Redis pub/sub so other servers know
        await redis.publish(
            f"presence:{user_id}",
            json.dumps({"user_id": user_id, "online": online})
        )

manager = ConnectionManager()

@app.websocket("/ws/{user_id}")
async def websocket_endpoint(websocket: WebSocket, user_id: str):
    await manager.connect(user_id, websocket)
    try:
        while True:
            data = await websocket.receive_json()
            await handle_message(user_id, data)
    except WebSocketDisconnect:
        await manager.disconnect(user_id)

The Problem With WebSockets at Scale

A single server can maintain roughly 65,000 concurrent WebSocket connections (limited by OS file descriptors). At 50M DAU with 20% concurrently online, that's 10M concurrent connections — requiring 153+ chat servers just to hold connections.

This creates the core architectural challenge: User A's connection is on Server 1. User B's connection is on Server 7. How does a message from A reach B?

The answer is a message broker acting as the backbone between servers.


The Architecture

Here's the full system:

                    ┌─────────────────────────────┐
                    │         Clients              │
                    │   (WebSocket connections)    │
                    └──────────┬──────────────────┘
                               │
                    ┌──────────▼──────────────────┐
                    │      Load Balancer           │
                    │  (sticky sessions by user)   │
                    └──────────┬──────────────────┘
                               │
          ┌────────────────────┼────────────────────┐
          │                    │                    │
   ┌──────▼──────┐    ┌────────▼──────┐    ┌───────▼──────┐
   │  Chat Server│    │  Chat Server  │    │  Chat Server  │
   │     #1      │    │     #2        │    │     #3        │
   └──────┬──────┘    └──────┬────────┘    └──────┬───────┘
          │                  │                    │
          └──────────────────┼────────────────────┘
                             │
                    ┌────────▼────────────┐
                    │   Message Broker    │
                    │  (Kafka / SQS)      │
                    └────────┬────────────┘
                             │
          ┌──────────────────┼────────────────────┐
          │                  │                    │
   ┌──────▼──────┐    ┌──────▼──────┐    ┌───────▼──────┐
   │  Message    │    │   Presence  │    │  Push Notif  │
   │  Storage    │    │   Service   │    │  Service     │
   │ (DynamoDB)  │    │  (Redis)    │    │ (APNs/FCM)   │
   └─────────────┘    └─────────────┘    └──────────────┘

Component Responsibilities

Load Balancer — Sticky Sessions

Route each user to the same chat server for the duration of their session. This keeps WebSocket connections stable and avoids re-routing overhead. Use consistent hashing on user_id.

Chat Servers — Connection Holders

Each chat server does three things:

  1. Maintains WebSocket connections for its assigned users

  2. Receives incoming messages and publishes to the message broker

  3. Subscribes to the message broker and delivers messages to connected users

Message Broker (Kafka)

The backbone. Every message flows through Kafka, which provides:

  • Guaranteed delivery between chat servers

  • Replay capability for debugging

  • Fan-out for group messages

  • Decoupling of send and receive paths

Message Storage (DynamoDB)

Persistent message history. DynamoDB's access pattern fits perfectly — you almost always query by conversation_id to get recent messages.

Presence Service (Redis)

Tracks who is online. Redis pub/sub broadcasts presence changes to all interested servers in real time.

Push Notification Service

When a recipient is offline, route to APNs (iOS) or FCM (Android) instead of a WebSocket.


Part 2: Message Delivery Guarantees — The Hard Part

This is where most system design discussions stop too early. Let's go deeper.

The Three Guarantees

At-most-once: Message is sent once. If delivery fails, it's not retried. Messages can be lost. Never acceptable for chat.

At-least-once: Message is retried until acknowledged. Messages might be delivered multiple times (duplicates). Acceptable if you handle deduplication.

Exactly-once: Message is delivered precisely once, no duplicates, no losses. Theoretically ideal. In practice, extremely hard to implement correctly across distributed systems.

For chat systems: target at-least-once delivery with client-side deduplication.

The Message Flow With Delivery Guarantees

1. Sender assigns a client-generated idempotency key (UUID)
2. Message sent to Chat Server
3. Chat Server persists to DynamoDB (idempotency key as sort key)
4. Chat Server publishes to Kafka
5. Chat Server returns ACK to sender → sender marks as "sent" ✓
6. Kafka consumer delivers to recipient's Chat Server
7. Recipient's Chat Server delivers via WebSocket
8. Recipient's client sends ACK
9. Chat Server publishes delivery receipt to Kafka
10. Sender's Chat Server receives receipt → sender marks "delivered" ✓
11. Recipient opens conversation → "read" receipt sent
12. Sender marks "read" ✓

The idempotency key is critical. If step 3 succeeds but the server crashes before step 5, the sender will retry. Without an idempotency key, you get duplicate messages. With it, the database insert is a no-op on retry.

import uuid
from datetime import datetime

class Message:
    def __init__(
        self,
        sender_id: str,
        conversation_id: str,
        content: str,
        idempotency_key: str = None
    ):
        self.message_id = str(uuid.uuid4())
        self.idempotency_key = idempotency_key or str(uuid.uuid4())
        self.sender_id = sender_id
        self.conversation_id = conversation_id
        self.content = content
        self.timestamp = datetime.utcnow().isoformat()
        self.status = "sent"

async def send_message(message: Message) -> dict:
    # Idempotent write — if key exists, return existing record
    try:
        await dynamodb.put_item(
            TableName="messages",
            Item={
                "conversation_id": {"S": message.conversation_id},
                "message_id": {"S": message.message_id},
                "idempotency_key": {"S": message.idempotency_key},
                "sender_id": {"S": message.sender_id},
                "content": {"S": message.content},
                "timestamp": {"S": message.timestamp},
                "status": {"S": "sent"}
            },
            ConditionExpression="attribute_not_exists(idempotency_key)"
        )
    except dynamodb.exceptions.ConditionalCheckFailedException:
        # Already saved — idempotent, return success
        pass

    # Publish to Kafka regardless (Kafka consumer handles dedup)
    await kafka_producer.send(
        topic="messages",
        key=message.conversation_id.encode(),
        value=message.__dict__
    )

    return {"status": "sent", "message_id": message.message_id}

Handling Offline Users

When a recipient is offline, the message must not be lost. The flow changes:

async def deliver_message(message: dict, recipient_id: str):
    # Try WebSocket first
    delivered = await manager.send_to_user(recipient_id, message)

    if not delivered:
        # User offline — store in their message queue
        await dynamodb.put_item(
            TableName="offline_queue",
            Item={
                "user_id": {"S": recipient_id},
                "timestamp": {"S": message["timestamp"]},
                "message_id": {"S": message["message_id"]},
                "message": {"S": json.dumps(message)}
            }
        )
        # Send push notification
        await push_service.notify(
            user_id=recipient_id,
            title=f"New message from {message['sender_id']}",
            body=message["content"][:100]
        )

async def on_user_connect(user_id: str):
    # Drain offline queue on reconnect
    queued = await dynamodb.query(
        TableName="offline_queue",
        KeyConditionExpression="user_id = :uid",
        ExpressionAttributeValues={":uid": {"S": user_id}},
        ScanIndexForward=True  # oldest first
    )

    for item in queued["Items"]:
        message = json.loads(item["message"]["S"])
        await manager.send_to_user(user_id, message)

    # Clear the queue
    for item in queued["Items"]:
        await dynamodb.delete_item(
            TableName="offline_queue",
            Key={
                "user_id": {"S": user_id},
                "timestamp": {"S": item["timestamp"]["S"]}
            }
        )

Part 3: Group Messaging — The Fan-Out Problem

1-on-1 messaging is a solved problem at this point. Group messaging is where things get genuinely hard.

The Fan-Out Challenge

When User A sends a message to a group of 500 people:

  • 500 delivery operations need to happen

  • Members are spread across many chat servers

  • Some members are offline

  • Some members have the conversation muted

  • The operation needs to be fast from A's perspective

Two approaches:

Fan-out on write: When a message is sent, immediately write it to every member's inbox. Each member pulls from their own inbox.

Pros: Fast reads — inbox is pre-computed
Cons: Expensive writes — 500 writes per message
     Wasteful for large groups with many inactive members

Fan-out on read: Store the message once. When a member opens the conversation, compute their view.

Pros: Single write per message, storage efficient
Cons: Expensive reads — compute on every open
     Slower first load for large groups

Hybrid approach (what WhatsApp actually uses):

  • For small groups (< 100 members): fan-out on write — fast delivery to active members

  • For large groups (100-500 members): store once, use Kafka partitions per group, let each member's server pull

SMALL_GROUP_THRESHOLD = 100

async def handle_group_message(message: Message, group_id: str):
    # Get group members
    members = await get_group_members(group_id)

    if len(members) <= SMALL_GROUP_THRESHOLD:
        # Fan-out on write — direct delivery
        await fan_out_write(message, members)
    else:
        # Store once, publish to group topic
        await store_message(message)
        await kafka_producer.send(
            topic=f"group.{group_id}",
            value=message.__dict__
        )

async def fan_out_write(message: Message, members: list):
    # Publish one Kafka event per member
    tasks = [
        kafka_producer.send(
            topic=f"user.{member_id}",
            value=message.__dict__
        )
        for member_id in members
        if member_id != message.sender_id
    ]
    await asyncio.gather(*tasks)

Group Delivery Receipts — Don't Do What iMessage Does

Showing individual read receipts for every member in a 500-person group is a scaling nightmare — 500 receipt events per message read.

Practical approach:

  • Store receipts as a bitmap or counter, not individual records

  • Show "Delivered to N members" rather than individual names

  • Only compute individual receipts for groups under 20 members

async def update_group_receipt(
    message_id: str,
    group_id: str,
    user_id: str,
    receipt_type: str  # "delivered" or "read"
):
    # Atomic increment — no race conditions
    await dynamodb.update_item(
        TableName="group_receipts",
        Key={
            "message_id": {"S": message_id},
            "group_id": {"S": group_id}
        },
        UpdateExpression=f"ADD {receipt_type}_count :inc",
        ExpressionAttributeValues={":inc": {"N": "1"}}
    )

DynamoDB Schema

The access patterns for chat are simple but the schema design matters enormously for cost and performance.

Table: messages
├── PK: conversation_id (String)     ← partition by conversation
├── SK: timestamp#message_id (String) ← sort by time, unique
├── sender_id (String)
├── content (String)
├── message_type (String)            ← text, image, video
├── status (String)                  ← sent, delivered, read
└── idempotency_key (String)

GSI: sender_id-timestamp-index
├── PK: sender_id
└── SK: timestamp
(For "messages sent by user" queries)

Table: conversations
├── PK: user_id (String)
├── SK: last_message_timestamp (String)
├── conversation_id (String)
├── participant_ids (List)
└── unread_count (Number)

Table: offline_queue
├── PK: user_id (String)
└── SK: timestamp (String)

Key design decisions:

  • Partition by conversation_id not user_id — conversations are the natural unit of access

  • Use composite sort key timestamp#message_id — enables time-range queries and guarantees uniqueness even for messages sent at the same millisecond

  • TTL on offline_queue — auto-expire after 30 days so storage doesn't grow unbounded


Presence System

Real-time online/offline indicators are deceptively complex at scale.

The naive approach: Query the database for last-seen timestamp on every profile view. At scale, this creates a read hotspot.

Better approach: Redis with TTL + pub/sub

PRESENCE_TTL = 60  # seconds

async def heartbeat(user_id: str):
    # Client sends heartbeat every 30 seconds
    # Server refreshes TTL — if it expires, user is offline
    await redis.setex(
        f"presence:{user_id}",
        PRESENCE_TTL,
        "online"
    )

async def is_online(user_id: str) -> bool:
    return await redis.exists(f"presence:{user_id}") == 1

async def subscribe_to_presence(user_ids: list, callback):
    # Subscribe to presence changes for a list of users
    # Used to update UI in real time when contacts go online/offline
    pubsub = redis.pubsub()
    channels = [f"presence:{uid}" for uid in user_ids]
    await pubsub.subscribe(*channels)

    async for message in pubsub.listen():
        if message["type"] == "message":
            await callback(json.loads(message["data"]))

Presence at scale consideration: Don't broadcast presence to all followers. WhatsApp only shows presence to mutual contacts — and even then, only when the user opens a conversation. This limits the fan-out to manageable levels.


Scaling to Millions of Users

The Numbers

Metric Value
Daily Active Users 50 million
Concurrent connections (20%) 10 million
Messages per day 5 billion
Messages per second (peak 3x avg) ~174,000
Average message size 1 KB
Storage per day ~5 TB

Horizontal Scaling Plan

Chat servers: Stateless except for active WebSocket connections. Scale horizontally — add servers as concurrent connections grow. Target: 50,000 connections per server = 200 servers at peak.

Kafka: Partition by conversation_id. This ensures all messages in a conversation are ordered and processed by the same consumer. Use 1,000 partitions — allows scaling to 1,000 parallel consumers.

DynamoDB: Serverless — scales automatically. At 174,000 writes/second, provision ~200,000 WCU with auto-scaling. Cost at this scale: ~$35,000/month. Optimise with DynamoDB Accelerator (DAX) for read-heavy workloads like message history.

Redis (Presence): Use Redis Cluster. Shard by user_id. At 10M concurrent users, each key is ~100 bytes = ~1GB total — fits comfortably in a 3-node Redis cluster.

Geographic Distribution

For a global user base, co-locate users with the closest region:

User in Singapore → AP-Southeast Chat Servers → AP Kafka Cluster
User in London    → EU-West Chat Servers    → EU Kafka Cluster

Cross-region messages:
AP Kafka → Cross-region replication → EU Kafka → EU Chat Server → User

Use AWS Global Accelerator to route users to the nearest chat server cluster. Accept ~50ms cross-region latency for international messages — it's acceptable and much cheaper than a single global cluster.


Media Sharing — The Bandwidth Problem

Never send media through the chat server. It'll saturate your WebSocket connections.

The right approach:

1. Client requests a pre-signed S3 URL from the media service
2. Client uploads directly to S3 (bypasses chat server entirely)
3. Client sends a message containing the S3 object key
4. Recipient's client downloads directly from CloudFront (CDN)
import boto3
from botocore.config import Config

s3 = boto3.client(
    's3',
    config=Config(signature_version='s3v4')
)

async def get_upload_url(
    user_id: str,
    file_type: str,
    file_size_bytes: int
) -> dict:
    # Validate file size (50MB limit)
    if file_size_bytes > 50 * 1024 * 1024:
        raise ValueError("File too large")

    object_key = f"media/{user_id}/{uuid.uuid4()}"

    presigned_url = s3.generate_presigned_url(
        'put_object',
        Params={
            'Bucket': 'chat-media-bucket',
            'Key': object_key,
            'ContentType': file_type,
            'ContentLength': file_size_bytes
        },
        ExpiresIn=300  # 5 minutes to complete upload
    )

    return {
        "upload_url": presigned_url,
        "object_key": object_key,
        "cdn_url": f"https://cdn.yourdomain.com/{object_key}"
    }

This keeps your chat servers lean — they only handle small JSON payloads, never binary data.


What I'd Skip in V1

Not everything needs to be built on day one:

  • End-to-end encryption — Signal Protocol is complex. Add it in V2 once the core is stable.

  • Voice and video calls — WebRTC is a separate system entirely. Separate service, separate team.

  • Message reactions — Nice to have. A simple DynamoDB list attribute works until you have emoji reaction analytics requirements.

  • Message editing and deletion — Soft delete first (mark as deleted), hard delete later when compliance requirements are clear.

  • Multi-device sync — Start with single-device. Multi-device sync (like WhatsApp Web) is a significant engineering effort involving device registration and message fan-out to device sets.


Key Takeaways

  1. WebSockets are necessary but not sufficient — you also need a message broker between servers to route messages across your fleet

  2. Target at-least-once delivery with client-side deduplication — exactly-once is theoretically appealing but practically expensive

  3. Idempotency keys are non-negotiable — retries without them create duplicate messages

  4. Fan-out strategy depends on group size — write fan-out for small groups, read fan-out for large ones

  5. Never route media through chat servers — S3 pre-signed URLs + CloudFront keep your servers fast

  6. Presence is a separate problem — Redis TTL + pub/sub, not a database column

  7. Partition Kafka by conversation_id — preserves message ordering within a conversation

  8. DynamoDB partition key is conversation, not user — conversations are the natural unit of access in chat


What Would You Design Differently?

Every chat system has its own constraints. Some teams prioritise encryption above all else. Others need to support 10,000-member broadcast channels. Some operate in regions where push notifications are unreliable.

What's the trickiest chat-related engineering problem you've encountered? Drop it in the comments — I read every one.


Follow me on LinkedIn for weekly posts on system design, AWS, and engineering career growth.