How to Design Systems That Scale to Millions: Lessons from Real-World Architectures
Theoretical system design is easy. Building systems that actually survive millions of users is hard. This guide distills real-world architecture patterns from companies like Uber, Stripe, and Discord into actionable lessons for engineers at any level.
Every engineer studies system design. Fewer engineers have actually built systems that serve millions of concurrent users under real-world constraints — budget limits, on-call pages at 3 AM, data consistency bugs that only manifest at scale, and features that marketing promised for next quarter.
This article is not another "how to design a URL shortener" tutorial. It is a collection of hard-won lessons from real production architectures, organized into the patterns that matter most. If you are preparing for a system design interview or building your next production system, these are the principles that separate toy projects from systems that survive contact with reality.
1. Database Sharding: What Actually Works
Every scaling article mentions sharding. Few discuss the operational nightmare of maintaining it. Here is what real systems teach us:
Hash-Based Sharding
Discord shards its message storage by channel_id. Every message belongs to a channel, and channels map deterministically to database shards. This works because the access pattern is almost always "fetch messages for this channel" — the shard key aligns with the query pattern.
-- Determine shard from channel_id
-- shard_number = channel_id % total_shards
-- Each shard is a separate PostgreSQL instance
-- Shard 0: channels 0, 16, 32, ...
-- Shard 1: channels 1, 17, 33, ...
-- Shard N: channels N, N+16, N+32, ...
SELECT * FROM messages
WHERE channel_id = 98234
ORDER BY created_at DESC
LIMIT 50;
The Golden Rule of Sharding
Your shard key must match your primary access pattern. If you shard users by user_id but your most common query is "find all users in organization X," every single query becomes a scatter-gather across all shards. This is worse than no sharding at all.
| Company | Shard Key | Primary Access Pattern | Why It Works |
|---|---|---|---|
| Discord | channel_id | Messages by channel | Single-shard reads for 99% of queries |
| Stripe | merchant_id | Transactions by merchant | All merchant data co-located |
| Uber | geo_region | Rides/drivers by city | Geographic locality matches real-world access |
| Slack | workspace_id | Messages by workspace | Workspace data is self-contained |
For a complete breakdown of sharding strategies, see our system design curriculum.
2. Caching Hierarchies: The L1/L2/CDN Pattern
Caching is not "just add Redis." Production systems use layered caching strategies where each layer serves a different purpose:
| Layer | Technology | Latency | Hit Rate Target | What It Caches |
|---|---|---|---|---|
| L1 (In-Process) | In-memory map/LRU | <1ms | 60-80% | Hot config, session data, feature flags |
| L2 (Distributed) | Redis / Memcached | 1-5ms | 85-95% | User profiles, computed results, API responses |
| L3 (CDN Edge) | CloudFront / Cloudflare | 5-50ms | 95-99% | Static assets, rendered pages, API responses |
| Origin | Database | 10-100ms | N/A | Source of truth |
Cache Invalidation Strategy
The hardest problem in caching is invalidation. Here are three patterns that work at scale:
- TTL-based expiry: Set a time-to-live and accept stale data within that window. Simple and effective for data that does not need to be real-time (product catalogs, user profiles).
- Write-through: On every write, update the cache simultaneously. Used when read-after-write consistency is critical (shopping carts, account balances).
- Event-driven invalidation: Publish a cache invalidation event to a message queue when data changes. All services that cache that data subscribe and purge their local copies. This is what Uber uses for driver location data.
// Layered caching example with write-through and TTL
class CacheHierarchy {
private l1: Map<string, { data: unknown; expiry: number }> = new Map();
private redis: RedisClient;
async get(key: string): Promise<unknown> {
// L1: In-process check
const l1Entry = this.l1.get(key);
if (l1Entry && l1Entry.expiry > Date.now()) {
return l1Entry.data;
}
// L2: Redis check
const l2Data = await this.redis.get(key);
if (l2Data) {
// Backfill L1
this.l1.set(key, { data: JSON.parse(l2Data), expiry: Date.now() + 30_000 });
return JSON.parse(l2Data);
}
return null; // Cache miss — caller fetches from DB
}
async set(key: string, data: unknown, ttlMs: number): Promise<void> {
// Write-through: update both layers
this.l1.set(key, { data, expiry: Date.now() + Math.min(ttlMs, 30_000) });
await this.redis.set(key, JSON.stringify(data), "PX", ttlMs);
}
}
3. Event-Driven vs. Request-Driven at Scale
At low scale, request-driven (synchronous REST/gRPC) architecture works fine. At high scale, event-driven architecture becomes essential for three reasons:
- Decoupling: Services do not need to know about each other. The payment service emits a "payment.completed" event; the notification service, analytics service, and inventory service each consume it independently.
- Resilience: If the notification service goes down, events queue up in Kafka. When it recovers, it processes the backlog. In a synchronous system, the payment service would either fail or need complex retry logic.
- Throughput: Kafka partitions allow parallel processing. Stripe processes millions of webhook events per day using partitioned event streams.
// Event-driven order processing
// Producer: Order Service
await kafka.produce("order.events", {
type: "order.placed",
orderId: "ord_abc123",
userId: "usr_xyz789",
items: [{ sku: "WIDGET-01", quantity: 2, price: 29.99 }],
timestamp: new Date().toISOString(),
});
// Consumer: Inventory Service (independent)
kafka.consume("order.events", async (event) => {
if (event.type === "order.placed") {
await reserveInventory(event.items);
}
});
// Consumer: Notification Service (independent)
kafka.consume("order.events", async (event) => {
if (event.type === "order.placed") {
await sendConfirmationEmail(event.userId, event.orderId);
}
});
The key architectural decision is which operations should be synchronous and which should be asynchronous. A good heuristic: if the user is waiting for the result, make it synchronous. If the user does not need immediate confirmation, make it asynchronous.
4. Observability-Driven Development
At scale, you cannot debug by reading logs. You need structured observability built into the architecture from day one. The three pillars are well-known — metrics, logs, traces — but the implementation details matter:
The Observability Stack
- Structured logging: Every log line is JSON with a correlation ID that traces a request across services. Never use unstructured string logs in production.
- Distributed tracing: OpenTelemetry is now the industry standard. Every inbound request creates a trace ID that propagates through every service call, database query, and cache lookup.
- SLOs over alerts: Instead of alerting on individual metrics (CPU > 80%), define Service Level Objectives (e.g., "99.9% of checkout requests complete within 500ms") and alert when the error budget is burning too fast.
// Structured log with trace context
logger.info({
event: "order.processed",
traceId: span.traceId,
orderId: "ord_abc123",
duration_ms: 142,
cache_hit: true,
shard: 7,
});
For a complete guide to observability patterns, explore our software architecture learning path.
5. Cost-Aware Architecture
In 2026, cloud costs are a first-class architectural concern. The era of "just scale up" is over. FinOps (Financial Operations) is now a discipline that sits alongside DevOps and SRE.
Cost Optimization Patterns
- Right-size your compute: 40% of cloud instances are overprovisioned. Use auto-scaling with aggressive scale-down policies. Spot/preemptible instances can reduce compute costs by 60-70% for fault-tolerant workloads.
- Tiered storage: Move data older than 30 days to cold storage. Archive data older than 90 days. This alone can cut storage costs by 50%.
- Cache before compute: Every cache hit is a database query you did not pay for. At scale, a well-tuned caching layer can reduce your database tier size (and cost) by 5-10x.
- Edge computing: Run compute at CDN edges for latency-sensitive operations. This reduces origin traffic and improves user experience simultaneously.
| Optimization | Typical Savings | Effort | Risk |
|---|---|---|---|
| Right-sizing instances | 20-40% | Low | Low |
| Spot instances for batch jobs | 60-70% | Medium | Medium |
| Tiered storage migration | 40-60% | Medium | Low |
| Caching layer optimization | 30-50% on DB costs | High | Medium |
| Reserved instances (1yr commit) | 30-40% | Low | Low (commitment risk) |
6. Putting It All Together
Here is the architecture pattern that emerges from these lessons, applicable whether you are building a fintech platform, a social network, or a SaaS product:
- Edge layer: CDN + edge functions handle static assets and cacheable API responses.
- API gateway: Rate limiting, authentication, request routing. Exposes a unified API surface.
- Service layer: Domain-specific services communicate via gRPC (synchronous) and Kafka (asynchronous).
- Data layer: Sharded PostgreSQL for transactional data, Redis for caching, object storage for blobs.
- Observability layer: OpenTelemetry traces, structured logs, SLO-based alerting across all layers.
The systems that scale are not the ones with the most clever code. They are the ones built on sound principles, rigorously observed, and continuously refined. Start with the fundamentals in our system design and cloud architecture courses, and build from there.