How to Design Systems That Scale to Millions: Lessons from Real-World Architectures

Every engineer studies system design. Fewer engineers have actually built systems that serve millions of concurrent users under real-world constraints — budget limits, on-call pages at 3 AM, data consistency bugs that only manifest at scale, and features that marketing promised for next quarter.

This article is not another "how to design a URL shortener" tutorial. It is a collection of hard-won lessons from real production architectures, organized into the patterns that matter most. If you are preparing for a system design interview or building your next production system, these are the principles that separate toy projects from systems that survive contact with reality.

1. Database Sharding: What Actually Works

Every scaling article mentions sharding. Few discuss the operational nightmare of maintaining it. Here is what real systems teach us:

Hash-Based Sharding

Discord shards its message storage by channel_id. Every message belongs to a channel, and channels map deterministically to database shards. This works because the access pattern is almost always "fetch messages for this channel" — the shard key aligns with the query pattern.

-- Determine shard from channel_id
-- shard_number = channel_id % total_shards
-- Each shard is a separate PostgreSQL instance

-- Shard 0: channels 0, 16, 32, ...
-- Shard 1: channels 1, 17, 33, ...
-- Shard N: channels N, N+16, N+32, ...

SELECT * FROM messages
WHERE channel_id = 98234
ORDER BY created_at DESC
LIMIT 50;

The Golden Rule of Sharding

Your shard key must match your primary access pattern. If you shard users by user_id but your most common query is "find all users in organization X," every single query becomes a scatter-gather across all shards. This is worse than no sharding at all.

Company	Shard Key	Primary Access Pattern	Why It Works
Discord	channel_id	Messages by channel	Single-shard reads for 99% of queries
Stripe	merchant_id	Transactions by merchant	All merchant data co-located
Uber	geo_region	Rides/drivers by city	Geographic locality matches real-world access
Slack	workspace_id	Messages by workspace	Workspace data is self-contained

For a complete breakdown of sharding strategies, see our system design curriculum.

2. Caching Hierarchies: The L1/L2/CDN Pattern

Caching is not "just add Redis." Production systems use layered caching strategies where each layer serves a different purpose:

Layer	Technology	Latency	Hit Rate Target	What It Caches
L1 (In-Process)	In-memory map/LRU	<1ms	60-80%	Hot config, session data, feature flags
L2 (Distributed)	Redis / Memcached	1-5ms	85-95%	User profiles, computed results, API responses
L3 (CDN Edge)	CloudFront / Cloudflare	5-50ms	95-99%	Static assets, rendered pages, API responses
Origin	Database	10-100ms	N/A	Source of truth

Cache Invalidation Strategy

The hardest problem in caching is invalidation. Here are three patterns that work at scale:

TTL-based expiry: Set a time-to-live and accept stale data within that window. Simple and effective for data that does not need to be real-time (product catalogs, user profiles).
Write-through: On every write, update the cache simultaneously. Used when read-after-write consistency is critical (shopping carts, account balances).
Event-driven invalidation: Publish a cache invalidation event to a message queue when data changes. All services that cache that data subscribe and purge their local copies. This is what Uber uses for driver location data.

// Layered caching example with write-through and TTL
class CacheHierarchy {
  private l1: Map<string, { data: unknown; expiry: number }> = new Map();
  private redis: RedisClient;

  async get(key: string): Promise<unknown> {
    // L1: In-process check
    const l1Entry = this.l1.get(key);
    if (l1Entry && l1Entry.expiry > Date.now()) {
      return l1Entry.data;
    }

    // L2: Redis check
    const l2Data = await this.redis.get(key);
    if (l2Data) {
      // Backfill L1
      this.l1.set(key, { data: JSON.parse(l2Data), expiry: Date.now() + 30_000 });
      return JSON.parse(l2Data);
    }

    return null; // Cache miss — caller fetches from DB
  }

  async set(key: string, data: unknown, ttlMs: number): Promise<void> {
    // Write-through: update both layers
    this.l1.set(key, { data, expiry: Date.now() + Math.min(ttlMs, 30_000) });
    await this.redis.set(key, JSON.stringify(data), "PX", ttlMs);
  }
}

3. Event-Driven vs. Request-Driven at Scale

At low scale, request-driven (synchronous REST/gRPC) architecture works fine. At high scale, event-driven architecture becomes essential for three reasons:

Decoupling: Services do not need to know about each other. The payment service emits a "payment.completed" event; the notification service, analytics service, and inventory service each consume it independently.
Resilience: If the notification service goes down, events queue up in Kafka. When it recovers, it processes the backlog. In a synchronous system, the payment service would either fail or need complex retry logic.
Throughput: Kafka partitions allow parallel processing. Stripe processes millions of webhook events per day using partitioned event streams.

// Event-driven order processing
// Producer: Order Service
await kafka.produce("order.events", {
  type: "order.placed",
  orderId: "ord_abc123",
  userId: "usr_xyz789",
  items: [{ sku: "WIDGET-01", quantity: 2, price: 29.99 }],
  timestamp: new Date().toISOString(),
});

// Consumer: Inventory Service (independent)
kafka.consume("order.events", async (event) => {
  if (event.type === "order.placed") {
    await reserveInventory(event.items);
  }
});

// Consumer: Notification Service (independent)
kafka.consume("order.events", async (event) => {
  if (event.type === "order.placed") {
    await sendConfirmationEmail(event.userId, event.orderId);
  }
});

The key architectural decision is which operations should be synchronous and which should be asynchronous. A good heuristic: if the user is waiting for the result, make it synchronous. If the user does not need immediate confirmation, make it asynchronous.

4. Observability-Driven Development

At scale, you cannot debug by reading logs. You need structured observability built into the architecture from day one. The three pillars are well-known — metrics, logs, traces — but the implementation details matter:

The Observability Stack

Structured logging: Every log line is JSON with a correlation ID that traces a request across services. Never use unstructured string logs in production.
Distributed tracing: OpenTelemetry is now the industry standard. Every inbound request creates a trace ID that propagates through every service call, database query, and cache lookup.
SLOs over alerts: Instead of alerting on individual metrics (CPU > 80%), define Service Level Objectives (e.g., "99.9% of checkout requests complete within 500ms") and alert when the error budget is burning too fast.

// Structured log with trace context
logger.info({
  event: "order.processed",
  traceId: span.traceId,
  orderId: "ord_abc123",
  duration_ms: 142,
  cache_hit: true,
  shard: 7,
});

For a complete guide to observability patterns, explore our software architecture learning path.

5. Cost-Aware Architecture

In 2026, cloud costs are a first-class architectural concern. The era of "just scale up" is over. FinOps (Financial Operations) is now a discipline that sits alongside DevOps and SRE.

Cost Optimization Patterns

Right-size your compute: 40% of cloud instances are overprovisioned. Use auto-scaling with aggressive scale-down policies. Spot/preemptible instances can reduce compute costs by 60-70% for fault-tolerant workloads.
Tiered storage: Move data older than 30 days to cold storage. Archive data older than 90 days. This alone can cut storage costs by 50%.
Cache before compute: Every cache hit is a database query you did not pay for. At scale, a well-tuned caching layer can reduce your database tier size (and cost) by 5-10x.
Edge computing: Run compute at CDN edges for latency-sensitive operations. This reduces origin traffic and improves user experience simultaneously.

Optimization	Typical Savings	Effort	Risk
Right-sizing instances	20-40%	Low	Low
Spot instances for batch jobs	60-70%	Medium	Medium
Tiered storage migration	40-60%	Medium	Low
Caching layer optimization	30-50% on DB costs	High	Medium
Reserved instances (1yr commit)	30-40%	Low	Low (commitment risk)

6. Putting It All Together

Here is the architecture pattern that emerges from these lessons, applicable whether you are building a fintech platform, a social network, or a SaaS product:

Edge layer: CDN + edge functions handle static assets and cacheable API responses.
API gateway: Rate limiting, authentication, request routing. Exposes a unified API surface.
Service layer: Domain-specific services communicate via gRPC (synchronous) and Kafka (asynchronous).
Data layer: Sharded PostgreSQL for transactional data, Redis for caching, object storage for blobs.
Observability layer: OpenTelemetry traces, structured logs, SLO-based alerting across all layers.

The systems that scale are not the ones with the most clever code. They are the ones built on sound principles, rigorously observed, and continuously refined. Start with the fundamentals in our system design and cloud architecture courses, and build from there.