What Is System Design?
System design is the process of defining the architecture, components, modules, interfaces, and data flows of a system to satisfy specified requirements. It sits at the intersection of software engineering and infrastructure engineering, requiring you to think about how individual pieces of software work together at scale to deliver a reliable, performant product.
Whether you are building a URL shortener that handles a few thousand requests per day or a social media feed that serves billions of users, system design thinking helps you make informed trade-off decisions that keep your system running smoothly as demand grows.
Why System Design Matters
- Scale: Modern applications serve millions of users across the globe. Without proper design, a system will collapse under load.
- Reliability: Downtime costs companies millions. Good design anticipates failures and handles them gracefully.
- Maintainability: Systems outlive the engineers who build them. Clear design makes future development faster and safer.
- Cost Efficiency: Poor design wastes compute, storage, and bandwidth. Smart architecture saves real money.
- Career Growth: System design skills are tested in senior engineering interviews and are essential for tech leads and architects.
Core Concepts of System Design
Before diving into specific patterns and technologies, you need a solid grasp of the foundational properties every distributed system should strive for. These are the pillars against which every design decision is evaluated.
1. Scalability
Scalability is the ability of a system to handle a growing amount of work by adding resources. A scalable system can accommodate increased demand without significant changes to its architecture.
- Vertical Scaling (Scale Up): Adding more CPU, RAM, or storage to an existing machine. Simple but limited by hardware ceilings.
- Horizontal Scaling (Scale Out): Adding more machines to distribute the load. More complex but virtually unlimited and more fault tolerant.
The key insight is that horizontal scaling is almost always preferred for large-scale systems because no single machine, no matter how powerful, can handle the traffic of services like Netflix or Google.
2. Reliability
A reliable system continues to function correctly even when components fail. In distributed systems, failures are not exceptions — they are the norm. Hard drives fail, network packets get lost, and entire data centers can go offline. A reliable system is designed to tolerate these faults without losing data or becoming unavailable.
Techniques for improving reliability include redundancy (running multiple copies of services), replication (keeping copies of data in multiple locations), and fault isolation (ensuring a failure in one component does not cascade to others).
3. Availability
Availability measures the proportion of time a system is operational and accessible. It is typically expressed as a percentage, often referred to as "nines" of availability.
Availability Levels
| Availability | Downtime per Year | Downtime per Month |
|---|---|---|
| 99% (two nines) | 3.65 days | 7.31 hours |
| 99.9% (three nines) | 8.77 hours | 43.83 minutes |
| 99.99% (four nines) | 52.60 minutes | 4.38 minutes |
| 99.999% (five nines) | 5.26 minutes | 26.30 seconds |
Moving from three nines to four nines is dramatically harder and more expensive than going from two nines to three. Each additional nine typically requires significant engineering investment in redundancy, monitoring, and automated failover.
4. Maintainability
Maintainability encompasses how easy it is to fix bugs, add new features, and operate the system day to day. A maintainable system has clean abstractions, good documentation, comprehensive monitoring, and automated deployment pipelines.
The three aspects of maintainability are operability (making it easy for operations teams to keep the system running), simplicity (removing unnecessary complexity so new engineers can understand the system), and evolvability (making it easy to change the system as requirements evolve).
5. Latency and Throughput
Latency is the time it takes for a single request to travel from the client to the server and back. Throughput is the number of requests or operations a system can handle per unit of time. These two metrics often have an inverse relationship — optimizing for one can hurt the other.
Latency Numbers Every Engineer Should Know
- L1 cache reference: 0.5 ns
- L2 cache reference: 7 ns
- Main memory reference: 100 ns
- SSD random read: 150 microseconds
- HDD seek: 10 ms
- Round trip within same datacenter: 0.5 ms
- Round trip California to Netherlands: 150 ms
System Design Interview Framework
System design interviews evaluate your ability to design large-scale systems under ambiguity. Having a structured framework helps you stay organized and cover all critical aspects within the time limit (typically 35-45 minutes).
Step 1: Clarify Requirements (5 minutes)
Never jump straight into the solution. Start by asking questions to understand the scope. There are two types of requirements to identify:
- Functional Requirements: What should the system do? What are the core features? For example, for a URL shortener: create short URLs, redirect to original URLs, custom aliases, analytics.
- Non-Functional Requirements: How should the system perform? This includes latency targets, availability requirements, consistency needs, and expected scale.
Step 2: Estimate Scale (3 minutes)
Back-of-the-envelope calculations help you understand the order of magnitude you are dealing with. Key numbers to estimate:
- Daily active users (DAU)
- Read-to-write ratio
- Queries per second (QPS)
- Storage requirements over time
- Bandwidth needs
// Example: Estimating scale for a URL shortener
// Assumptions:
// - 100 million URLs created per month
// - Read:Write ratio = 100:1
// Writes per second:
// 100M / (30 days * 24 hours * 3600 seconds) ≈ 40 URLs/sec
// Reads per second:
// 40 * 100 = 4,000 redirects/sec
// Storage (5 years):
// Each URL record ≈ 500 bytes
// 100M * 12 months * 5 years * 500 bytes = 3 TB
Step 3: High-Level Design (10 minutes)
Sketch the major components on a whiteboard or diagram. Start with clients, load balancers, application servers, databases, and caches. Draw the data flow between them. This gives your interviewer a bird's eye view of your architecture before you dive into details.
Step 4: Deep Dive (15 minutes)
The interviewer will ask you to zoom into specific components. This is where your knowledge of databases, caching, message queues, and other infrastructure comes in. Be prepared to discuss:
- Database schema and choice of SQL vs NoSQL
- API design (REST, GraphQL, gRPC)
- Data partitioning and replication strategies
- Caching layers and invalidation
- How the system handles failures
Step 5: Wrap Up (5 minutes)
Summarize your design, discuss trade-offs you made, identify potential bottlenecks, and suggest future improvements. This demonstrates maturity and awareness that no design is perfect.
Key Metrics and SLAs
Service Level Agreements (SLAs) define the expected behavior of your system. They are contracts between you and your users. Here are the most important metrics:
- SLI (Service Level Indicator): A quantitative measure of service, such as request latency, error rate, or throughput.
- SLO (Service Level Objective): A target value for an SLI, such as "99th percentile latency under 200ms."
- SLA (Service Level Agreement): A formal contract with consequences (usually financial) if SLOs are not met.
- Error Budget: The amount of allowable downtime or errors before an SLO is violated. Teams use error budgets to balance reliability with feature velocity.
How to Approach Any System Design Problem
Beyond the interview framework, here is a mental model for approaching real-world design challenges:
- Start simple. Begin with a single-server architecture that handles everything. This is your baseline.
- Identify bottlenecks. As you scale, find the component that will fail first — usually the database or a single point of failure.
- Apply known patterns. Use load balancers, caches, message queues, CDNs, and database replication to address specific bottlenecks.
- Think about failure modes. What happens if a server crashes? If the database goes down? If a network partition occurs?
- Consider operational complexity. A system that requires a PhD to operate is not a good system. Simpler architectures are easier to debug and maintain.
- Make trade-offs explicit. Every decision has pros and cons. Document them and be prepared to justify your choices.
Common Mistakes in System Design
- Over-engineering: Designing for Google-scale when you have 100 users. Start with what you need today and plan for what you might need tomorrow.
- Ignoring requirements: Jumping into the solution before understanding the problem leads to building the wrong thing.
- Single point of failure: If any one component going down takes the entire system offline, your design is fragile.
- Neglecting monitoring: If you cannot observe your system, you cannot fix it when things go wrong.
- Forgetting about data: Systems live and die by their data. How data is stored, replicated, partitioned, and accessed determines system behavior.