TechLead
Lesson 16 of 30
7 min read
System Design

System Design: Chat Application

Design a real-time chat application covering WebSocket connections, message storage, group chats, presence systems, and end-to-end encryption

Problem Statement

Design a real-time chat application like WhatsApp, Slack, or Facebook Messenger. The system must support one-on-one messaging, group chats, online/offline status, and message delivery guarantees. This problem tests your understanding of real-time communication, persistent connections, and distributed messaging.

Step 1: Requirements

Functional Requirements

  • One-on-one messaging between users
  • Group chat (up to 500 members)
  • Online/offline/last seen status (presence)
  • Message delivery status: sent, delivered, read
  • Push notifications for offline users
  • Media sharing (images, files)
  • Message history and search

Non-Functional Requirements

  • Real-time delivery (<100ms for online users)
  • Message ordering guaranteed within a conversation
  • No message loss (at-least-once delivery)
  • Support 50 million daily active users
  • High availability

Step 2: Communication Protocol

The choice of communication protocol is the most fundamental decision for a chat system. HTTP request-response is not suitable for real-time bidirectional communication.

Protocol Comparison

Protocol How It Works Latency Use Case
HTTP PollingClient polls server periodicallyHigh (polling interval)Not suitable for chat
Long PollingServer holds request until data availableMediumFallback option
WebSocketFull-duplex persistent connectionVery lowPrimary choice for chat
Server-Sent EventsServer pushes to client (one-way)LowNotifications only

WebSocket is the clear choice for chat applications. It provides a persistent, bidirectional connection between the client and server, enabling real-time message delivery in both directions with minimal overhead.

// WebSocket connection management
import { WebSocketServer, WebSocket } from "ws";

interface ConnectedUser {
  userId: string;
  socket: WebSocket;
  serverId: string; // Which chat server this user is connected to
}

class ChatWebSocketServer {
  private connections = new Map<string, WebSocket>();
  private wss: WebSocketServer;

  constructor(port: number) {
    this.wss = new WebSocketServer({ port });
    this.wss.on("connection", this.handleConnection.bind(this));
  }

  private handleConnection(socket: WebSocket, request: any) {
    const userId = this.authenticateUser(request);
    if (!userId) {
      socket.close(4001, "Unauthorized");
      return;
    }

    // Register connection
    this.connections.set(userId, socket);
    this.updatePresence(userId, "online");

    socket.on("message", (data) => this.handleMessage(userId, data));

    socket.on("close", () => {
      this.connections.delete(userId);
      this.updatePresence(userId, "offline");
    });

    // Heartbeat to detect stale connections
    socket.on("pong", () => { /* connection is alive */ });
  }

  async sendToUser(targetUserId: string, message: any): Promise<boolean> {
    const socket = this.connections.get(targetUserId);
    if (socket && socket.readyState === WebSocket.OPEN) {
      socket.send(JSON.stringify(message));
      return true; // Delivered
    }
    return false; // User not on this server
  }
}

Step 3: System Architecture

A chat system must handle millions of concurrent WebSocket connections distributed across multiple servers. The architecture needs a way to route messages between users who may be connected to different servers.

Key Components

  • WebSocket Servers (Chat Servers): Maintain persistent connections with clients. Each server handles 50K-100K concurrent connections.
  • Connection Registry (Redis): Maps userId to the chat server they are connected to.
  • Message Queue (Kafka): Decouples message processing from delivery. Ensures durability and ordering.
  • Message Storage (Cassandra): Stores message history with efficient time-range queries.
  • Presence Service: Tracks online/offline status of users.
  • Push Notification Service: Delivers notifications to offline users via APNs/FCM.
// Message flow: User A sends a message to User B

interface ChatMessage {
  id: string;            // Globally unique ID (Snowflake)
  conversationId: string;
  senderId: string;
  content: string;
  contentType: "text" | "image" | "file";
  timestamp: number;
  status: "sent" | "delivered" | "read";
}

class MessageRouter {
  private redis: RedisClient;
  private kafka: KafkaProducer;
  private pushService: PushNotificationService;

  async routeMessage(message: ChatMessage, recipientId: string): Promise<void> {
    // Step 1: Persist the message
    await this.kafka.publish("messages", {
      key: message.conversationId, // Partition by conversation for ordering
      value: message,
    });

    // Step 2: Find which server the recipient is connected to
    const serverInfo = await this.redis.get(`conn:${recipientId}`);

    if (serverInfo) {
      // User is online - route to their chat server
      const { serverId } = JSON.parse(serverInfo);
      await this.forwardToServer(serverId, recipientId, message);
    } else {
      // User is offline - send push notification
      await this.pushService.send(recipientId, {
        title: `New message from ${message.senderId}`,
        body: message.content.substring(0, 100),
        data: { conversationId: message.conversationId },
      });
    }
  }

  private async forwardToServer(
    serverId: string,
    recipientId: string,
    message: ChatMessage
  ): Promise<void> {
    // Use internal RPC or pub/sub to reach the right server
    await this.redis.publish(`server:${serverId}`, JSON.stringify({
      type: "deliver",
      recipientId,
      message,
    }));
  }
}

Step 4: Message Storage and Retrieval

Chat message storage needs to handle extremely high write throughput and support efficient retrieval of message history within a conversation.

// Message storage schema (Cassandra)
// Partition key: conversation_id
// Clustering key: message_id (Snowflake, so it's time-ordered)

// CREATE TABLE messages (
//   conversation_id TEXT,
//   message_id BIGINT,
//   sender_id TEXT,
//   content TEXT,
//   content_type TEXT,
//   created_at TIMESTAMP,
//   PRIMARY KEY (conversation_id, message_id)
// ) WITH CLUSTERING ORDER BY (message_id DESC);

class MessageStore {
  // Get message history for a conversation (paginated)
  async getMessages(
    conversationId: string,
    beforeMessageId?: string,
    limit = 50
  ): Promise<ChatMessage[]> {
    let query = "SELECT * FROM messages WHERE conversation_id = ?";
    const params: any[] = [conversationId];

    if (beforeMessageId) {
      query += " AND message_id < ?";
      params.push(beforeMessageId);
    }

    query += " ORDER BY message_id DESC LIMIT ?";
    params.push(limit);

    return this.cassandra.execute(query, params);
  }

  // Store a new message
  async saveMessage(message: ChatMessage): Promise<void> {
    await this.cassandra.execute(
      "INSERT INTO messages (conversation_id, message_id, sender_id, content, content_type, created_at) VALUES (?, ?, ?, ?, ?, ?)",
      [message.conversationId, message.id, message.senderId, message.content, message.contentType, message.timestamp]
    );
  }
}

Step 5: Group Chat Design

Group chats add complexity because a single message must be delivered to multiple recipients. The approach depends on group size.

interface GroupChat {
  id: string;
  name: string;
  memberIds: string[];
  adminIds: string[];
  createdAt: Date;
}

class GroupMessageHandler {
  async sendGroupMessage(
    groupId: string,
    message: ChatMessage
  ): Promise<void> {
    // Persist the message once (not per member)
    await this.messageStore.saveMessage(message);

    // Get group members
    const members = await this.getGroupMembers(groupId);

    // Deliver to each member (except sender)
    const deliveryPromises = members
      .filter((memberId) => memberId !== message.senderId)
      .map((memberId) => this.router.routeMessage(message, memberId));

    // Fan-out delivery in parallel
    await Promise.allSettled(deliveryPromises);
  }
}

// For large groups (>100 members), consider:
// 1. Batch delivery to reduce per-message overhead
// 2. Use a pub/sub channel per group instead of individual delivery
// 3. Rate limit messages to prevent spam

Step 6: Online/Offline Status (Presence)

Presence tracking tells users which of their contacts are currently online. This seems simple but is challenging at scale because status changes are frequent and must be propagated to many interested users.

class PresenceService {
  private redis: RedisClient;

  // Called when user connects
  async setOnline(userId: string): Promise<void> {
    await this.redis.set(`presence:${userId}`, "online");
    // Notify friends/contacts about status change
    await this.broadcastStatusChange(userId, "online");
  }

  // Called when user disconnects
  async setOffline(userId: string): Promise<void> {
    // Don't immediately mark offline (handles brief disconnects)
    // Use a delayed approach
    await this.redis.set(`presence:${userId}`, "offline");
    await this.redis.set(`last_seen:${userId}`, Date.now().toString());
    await this.broadcastStatusChange(userId, "offline");
  }

  // Heartbeat approach: clients send heartbeats every 30 seconds
  // If no heartbeat received for 60 seconds, mark offline
  async heartbeat(userId: string): Promise<void> {
    await this.redis.setex(`heartbeat:${userId}`, 60, "alive");
    await this.redis.set(`presence:${userId}`, "online");
  }

  async isOnline(userId: string): Promise<boolean> {
    return (await this.redis.get(`presence:${userId}`)) === "online";
  }

  // Only broadcast to users who have a conversation with this user
  // Don't broadcast to ALL users (too expensive)
  private async broadcastStatusChange(
    userId: string,
    status: string
  ): Promise<void> {
    const recentContacts = await this.getRecentContacts(userId);
    for (const contactId of recentContacts) {
      await this.router.sendToUser(contactId, {
        type: "presence_update",
        userId,
        status,
        lastSeen: status === "offline" ? Date.now() : undefined,
      });
    }
  }
}

Step 7: Push Notifications

When a user is offline, messages must be delivered via push notifications through platform-specific services (APNs for iOS, FCM for Android).

  • Store device tokens per user (users may have multiple devices)
  • Respect user notification preferences (muted conversations, do-not-disturb)
  • Batch notifications for group chats to avoid notification flooding
  • Include enough context for the notification to be useful without revealing full message content

Step 8: End-to-End Encryption Basics

End-to-end encryption (E2EE) ensures that only the sender and recipient can read messages. The server only sees encrypted ciphertext and cannot decrypt it.

Signal Protocol (Used by WhatsApp)

  • Key Exchange: Each user generates a public/private key pair. Public keys are exchanged through the server.
  • Double Ratchet Algorithm: Generates a new encryption key for every message, providing forward secrecy.
  • Forward Secrecy: Even if a key is compromised, past messages cannot be decrypted.
  • Server's Role: The server only stores and forwards encrypted messages. It never has access to plaintext.

E2EE Trade-offs

  • Server-side search is impossible: The server cannot index or search encrypted messages
  • Multi-device is complex: Each device needs its own key pair and messages must be encrypted per device
  • Group chat key management: Adding/removing members requires rekeying the group
  • Backup encryption: Cloud backups must also be encrypted to maintain E2EE guarantees

Architecture Summary

  • Protocol: WebSocket for real-time bidirectional messaging
  • Message Routing: Redis for connection registry, pub/sub for cross-server delivery
  • Storage: Cassandra for message history (high write throughput, time-ordered)
  • Presence: Redis with heartbeat-based detection, broadcast only to recent contacts
  • Ordering: Snowflake IDs guarantee ordering within a conversation partition
  • Delivery Guarantees: At-least-once via retry + deduplication with message IDs

Continue Learning