Problem Statement
Design a real-time chat application like WhatsApp, Slack, or Facebook Messenger. The system must support one-on-one messaging, group chats, online/offline status, and message delivery guarantees. This problem tests your understanding of real-time communication, persistent connections, and distributed messaging.
Step 1: Requirements
Functional Requirements
- One-on-one messaging between users
- Group chat (up to 500 members)
- Online/offline/last seen status (presence)
- Message delivery status: sent, delivered, read
- Push notifications for offline users
- Media sharing (images, files)
- Message history and search
Non-Functional Requirements
- Real-time delivery (<100ms for online users)
- Message ordering guaranteed within a conversation
- No message loss (at-least-once delivery)
- Support 50 million daily active users
- High availability
Step 2: Communication Protocol
The choice of communication protocol is the most fundamental decision for a chat system. HTTP request-response is not suitable for real-time bidirectional communication.
Protocol Comparison
| Protocol | How It Works | Latency | Use Case |
|---|---|---|---|
| HTTP Polling | Client polls server periodically | High (polling interval) | Not suitable for chat |
| Long Polling | Server holds request until data available | Medium | Fallback option |
| WebSocket | Full-duplex persistent connection | Very low | Primary choice for chat |
| Server-Sent Events | Server pushes to client (one-way) | Low | Notifications only |
WebSocket is the clear choice for chat applications. It provides a persistent, bidirectional connection between the client and server, enabling real-time message delivery in both directions with minimal overhead.
// WebSocket connection management
import { WebSocketServer, WebSocket } from "ws";
interface ConnectedUser {
userId: string;
socket: WebSocket;
serverId: string; // Which chat server this user is connected to
}
class ChatWebSocketServer {
private connections = new Map<string, WebSocket>();
private wss: WebSocketServer;
constructor(port: number) {
this.wss = new WebSocketServer({ port });
this.wss.on("connection", this.handleConnection.bind(this));
}
private handleConnection(socket: WebSocket, request: any) {
const userId = this.authenticateUser(request);
if (!userId) {
socket.close(4001, "Unauthorized");
return;
}
// Register connection
this.connections.set(userId, socket);
this.updatePresence(userId, "online");
socket.on("message", (data) => this.handleMessage(userId, data));
socket.on("close", () => {
this.connections.delete(userId);
this.updatePresence(userId, "offline");
});
// Heartbeat to detect stale connections
socket.on("pong", () => { /* connection is alive */ });
}
async sendToUser(targetUserId: string, message: any): Promise<boolean> {
const socket = this.connections.get(targetUserId);
if (socket && socket.readyState === WebSocket.OPEN) {
socket.send(JSON.stringify(message));
return true; // Delivered
}
return false; // User not on this server
}
}
Step 3: System Architecture
A chat system must handle millions of concurrent WebSocket connections distributed across multiple servers. The architecture needs a way to route messages between users who may be connected to different servers.
Key Components
- WebSocket Servers (Chat Servers): Maintain persistent connections with clients. Each server handles 50K-100K concurrent connections.
- Connection Registry (Redis): Maps userId to the chat server they are connected to.
- Message Queue (Kafka): Decouples message processing from delivery. Ensures durability and ordering.
- Message Storage (Cassandra): Stores message history with efficient time-range queries.
- Presence Service: Tracks online/offline status of users.
- Push Notification Service: Delivers notifications to offline users via APNs/FCM.
// Message flow: User A sends a message to User B
interface ChatMessage {
id: string; // Globally unique ID (Snowflake)
conversationId: string;
senderId: string;
content: string;
contentType: "text" | "image" | "file";
timestamp: number;
status: "sent" | "delivered" | "read";
}
class MessageRouter {
private redis: RedisClient;
private kafka: KafkaProducer;
private pushService: PushNotificationService;
async routeMessage(message: ChatMessage, recipientId: string): Promise<void> {
// Step 1: Persist the message
await this.kafka.publish("messages", {
key: message.conversationId, // Partition by conversation for ordering
value: message,
});
// Step 2: Find which server the recipient is connected to
const serverInfo = await this.redis.get(`conn:${recipientId}`);
if (serverInfo) {
// User is online - route to their chat server
const { serverId } = JSON.parse(serverInfo);
await this.forwardToServer(serverId, recipientId, message);
} else {
// User is offline - send push notification
await this.pushService.send(recipientId, {
title: `New message from ${message.senderId}`,
body: message.content.substring(0, 100),
data: { conversationId: message.conversationId },
});
}
}
private async forwardToServer(
serverId: string,
recipientId: string,
message: ChatMessage
): Promise<void> {
// Use internal RPC or pub/sub to reach the right server
await this.redis.publish(`server:${serverId}`, JSON.stringify({
type: "deliver",
recipientId,
message,
}));
}
}
Step 4: Message Storage and Retrieval
Chat message storage needs to handle extremely high write throughput and support efficient retrieval of message history within a conversation.
// Message storage schema (Cassandra)
// Partition key: conversation_id
// Clustering key: message_id (Snowflake, so it's time-ordered)
// CREATE TABLE messages (
// conversation_id TEXT,
// message_id BIGINT,
// sender_id TEXT,
// content TEXT,
// content_type TEXT,
// created_at TIMESTAMP,
// PRIMARY KEY (conversation_id, message_id)
// ) WITH CLUSTERING ORDER BY (message_id DESC);
class MessageStore {
// Get message history for a conversation (paginated)
async getMessages(
conversationId: string,
beforeMessageId?: string,
limit = 50
): Promise<ChatMessage[]> {
let query = "SELECT * FROM messages WHERE conversation_id = ?";
const params: any[] = [conversationId];
if (beforeMessageId) {
query += " AND message_id < ?";
params.push(beforeMessageId);
}
query += " ORDER BY message_id DESC LIMIT ?";
params.push(limit);
return this.cassandra.execute(query, params);
}
// Store a new message
async saveMessage(message: ChatMessage): Promise<void> {
await this.cassandra.execute(
"INSERT INTO messages (conversation_id, message_id, sender_id, content, content_type, created_at) VALUES (?, ?, ?, ?, ?, ?)",
[message.conversationId, message.id, message.senderId, message.content, message.contentType, message.timestamp]
);
}
}
Step 5: Group Chat Design
Group chats add complexity because a single message must be delivered to multiple recipients. The approach depends on group size.
interface GroupChat {
id: string;
name: string;
memberIds: string[];
adminIds: string[];
createdAt: Date;
}
class GroupMessageHandler {
async sendGroupMessage(
groupId: string,
message: ChatMessage
): Promise<void> {
// Persist the message once (not per member)
await this.messageStore.saveMessage(message);
// Get group members
const members = await this.getGroupMembers(groupId);
// Deliver to each member (except sender)
const deliveryPromises = members
.filter((memberId) => memberId !== message.senderId)
.map((memberId) => this.router.routeMessage(message, memberId));
// Fan-out delivery in parallel
await Promise.allSettled(deliveryPromises);
}
}
// For large groups (>100 members), consider:
// 1. Batch delivery to reduce per-message overhead
// 2. Use a pub/sub channel per group instead of individual delivery
// 3. Rate limit messages to prevent spam
Step 6: Online/Offline Status (Presence)
Presence tracking tells users which of their contacts are currently online. This seems simple but is challenging at scale because status changes are frequent and must be propagated to many interested users.
class PresenceService {
private redis: RedisClient;
// Called when user connects
async setOnline(userId: string): Promise<void> {
await this.redis.set(`presence:${userId}`, "online");
// Notify friends/contacts about status change
await this.broadcastStatusChange(userId, "online");
}
// Called when user disconnects
async setOffline(userId: string): Promise<void> {
// Don't immediately mark offline (handles brief disconnects)
// Use a delayed approach
await this.redis.set(`presence:${userId}`, "offline");
await this.redis.set(`last_seen:${userId}`, Date.now().toString());
await this.broadcastStatusChange(userId, "offline");
}
// Heartbeat approach: clients send heartbeats every 30 seconds
// If no heartbeat received for 60 seconds, mark offline
async heartbeat(userId: string): Promise<void> {
await this.redis.setex(`heartbeat:${userId}`, 60, "alive");
await this.redis.set(`presence:${userId}`, "online");
}
async isOnline(userId: string): Promise<boolean> {
return (await this.redis.get(`presence:${userId}`)) === "online";
}
// Only broadcast to users who have a conversation with this user
// Don't broadcast to ALL users (too expensive)
private async broadcastStatusChange(
userId: string,
status: string
): Promise<void> {
const recentContacts = await this.getRecentContacts(userId);
for (const contactId of recentContacts) {
await this.router.sendToUser(contactId, {
type: "presence_update",
userId,
status,
lastSeen: status === "offline" ? Date.now() : undefined,
});
}
}
}
Step 7: Push Notifications
When a user is offline, messages must be delivered via push notifications through platform-specific services (APNs for iOS, FCM for Android).
- Store device tokens per user (users may have multiple devices)
- Respect user notification preferences (muted conversations, do-not-disturb)
- Batch notifications for group chats to avoid notification flooding
- Include enough context for the notification to be useful without revealing full message content
Step 8: End-to-End Encryption Basics
End-to-end encryption (E2EE) ensures that only the sender and recipient can read messages. The server only sees encrypted ciphertext and cannot decrypt it.
Signal Protocol (Used by WhatsApp)
- Key Exchange: Each user generates a public/private key pair. Public keys are exchanged through the server.
- Double Ratchet Algorithm: Generates a new encryption key for every message, providing forward secrecy.
- Forward Secrecy: Even if a key is compromised, past messages cannot be decrypted.
- Server's Role: The server only stores and forwards encrypted messages. It never has access to plaintext.
E2EE Trade-offs
- Server-side search is impossible: The server cannot index or search encrypted messages
- Multi-device is complex: Each device needs its own key pair and messages must be encrypted per device
- Group chat key management: Adding/removing members requires rekeying the group
- Backup encryption: Cloud backups must also be encrypted to maintain E2EE guarantees
Architecture Summary
- Protocol: WebSocket for real-time bidirectional messaging
- Message Routing: Redis for connection registry, pub/sub for cross-server delivery
- Storage: Cassandra for message history (high write throughput, time-ordered)
- Presence: Redis with heartbeat-based detection, broadcast only to recent contacts
- Ordering: Snowflake IDs guarantee ordering within a conversation partition
- Delivery Guarantees: At-least-once via retry + deduplication with message IDs