Trade-offs in System Design
Every system design decision involves trade-offs. There is no perfect architecture, only architectures that are optimized for specific constraints. Senior engineers distinguish themselves not by knowing more technologies, but by their ability to evaluate trade-offs systematically and make defensible decisions. This topic covers the most important trade-offs you will encounter and frameworks for evaluating them.
Consistency vs. Availability
The CAP theorem states that in the presence of a network partition, a distributed system must choose between consistency and availability. In practice, partitions are inevitable, so the real question is: when a partition occurs, do you return stale data (availability) or an error (consistency)?
| Choose Consistency (CP) | Choose Availability (AP) |
|---|---|
| Financial transactions | Social media feeds |
| Inventory management | Product catalog browsing |
| User authentication | Content recommendations |
| Leader election | Analytics dashboards |
| Distributed locks | DNS resolution |
The Real-World Nuance
Most real systems are not purely CP or AP. They use different consistency levels for different operations. An e-commerce platform might use strong consistency for payment processing (CP) but eventual consistency for product reviews (AP). The key insight is that consistency is not a system-wide property; it is a per-operation decision.
Strong vs. Eventual Consistency
This is a spectrum, not a binary choice. Understanding the options between the two extremes is crucial.
// Consistency spectrum from strongest to weakest
enum ConsistencyLevel {
// Linearizability: reads always see the most recent write
// Slowest, requires consensus (Raft/Paxos)
LINEARIZABLE = "LINEARIZABLE",
// Sequential consistency: operations appear in some total order
// consistent with each client's program order
SEQUENTIAL = "SEQUENTIAL",
// Causal consistency: causally related operations are seen in order
// Concurrent operations may be seen in any order
CAUSAL = "CAUSAL",
// Read-your-writes: a client always sees its own writes
// Other clients may see stale data
READ_YOUR_WRITES = "READ_YOUR_WRITES",
// Eventual consistency: all replicas converge eventually
// No ordering guarantees in the meantime
EVENTUAL = "EVENTUAL",
}
// Practical example: user profile update
interface UserProfileService {
// Strong consistency: user always sees their own updates immediately
// Implementation: read from primary database
updateProfile(userId: string, data: ProfileData): Promise<void>;
getOwnProfile(userId: string): Promise<ProfileData>; // Read from primary
// Eventual consistency: other users may see stale profile
// Implementation: read from nearest replica
getPublicProfile(userId: string): Promise<ProfileData>; // Read from replica
}Latency vs. Throughput
Optimizing for latency (how fast a single request completes) often conflicts with optimizing for throughput (how many requests per second the system handles).
| Optimize for Latency | Optimize for Throughput |
|---|---|
| Process each request immediately | Batch requests for processing |
| Keep data in memory | Sequential disk writes (append-only) |
| Use caching aggressively | Use queues to smooth traffic spikes |
| Fewer network hops | Pipeline operations across services |
| Real-time processing | Batch / stream processing |
// Example: Database write strategies
// Optimize for latency: write-through
async function writeThrough(key: string, value: string): Promise<void> {
// Write to DB synchronously, return when confirmed
await database.write(key, value); // 5ms
await cache.set(key, value); // 1ms
// Total: ~6ms per write, but only handles ~160 writes/sec per connection
}
// Optimize for throughput: write-behind (buffering)
class WriteBuffer {
private buffer: Map<string, string> = new Map();
private flushInterval = 100; // ms
async write(key: string, value: string): Promise<void> {
this.buffer.set(key, value);
await cache.set(key, value); // Immediate cache update
// Return immediately - data will be flushed to DB asynchronously
// Total: ~1ms per write, can handle thousands of writes/sec
// Trade-off: data loss risk if process crashes before flush
}
private async flush(): Promise<void> {
const batch = Array.from(this.buffer.entries());
this.buffer.clear();
await database.batchWrite(batch); // One DB call for many writes
}
}Monolith vs. Microservices
This is perhaps the most debated trade-off in software architecture. The right answer depends heavily on team size, system complexity, and organizational structure.
| Dimension | Monolith | Microservices |
|---|---|---|
| Deployment | Single deployable unit | Independent deployment per service |
| Development speed (small team) | Faster - less overhead | Slower - infrastructure complexity |
| Development speed (large team) | Slower - merge conflicts, coordination | Faster - teams work independently |
| Debugging | Easier - single process, local stack traces | Harder - distributed tracing required |
| Scaling | Scale the entire application | Scale individual services independently |
| Data consistency | Easy - single database, ACID transactions | Hard - distributed transactions, eventual consistency |
| Technology flexibility | One tech stack | Best tool for each service |
| Best for team size | 1-20 engineers | 50+ engineers with clear domain boundaries |
The Pragmatic Middle Ground
Start with a modular monolith: a single deployable unit with well-defined internal module boundaries. When a specific module needs independent scaling or a separate team, extract it into a service. This approach gives you the simplicity of a monolith with a path to microservices when the need arises. Premature decomposition into microservices is one of the most common and expensive architectural mistakes.
SQL vs. NoSQL
| Factor | SQL (PostgreSQL, MySQL) | NoSQL (MongoDB, Cassandra, DynamoDB) |
|---|---|---|
| Data model | Rigid schema, relational | Flexible schema, document/key-value/column |
| Query flexibility | Rich queries with JOINs, aggregations | Limited queries, optimized for specific access patterns |
| Transactions | Full ACID across multiple tables | Limited (single document or partition) |
| Horizontal scaling | Possible but complex | Built-in, often automatic |
| Best for | Complex relationships, ad-hoc queries, data integrity | High write throughput, flexible schemas, known access patterns |
// Decision guide
function chooseDatabaseType(requirements: {
needsJoins: boolean;
needsACID: boolean;
schemaIsStable: boolean;
writeVolume: "low" | "medium" | "high" | "extreme";
queryPatterns: "varied" | "known";
dataRelationships: "simple" | "complex";
}): string {
// Strong signals for SQL
if (requirements.needsACID && requirements.needsJoins) {
return "SQL (PostgreSQL recommended)";
}
// Strong signals for NoSQL
if (requirements.writeVolume === "extreme" &&
requirements.queryPatterns === "known" &&
!requirements.needsJoins) {
return "NoSQL (Cassandra for wide-column, DynamoDB for key-value)";
}
// Default recommendation
if (requirements.dataRelationships === "complex") {
return "SQL - complex relationships are hard to model in NoSQL";
}
return "Either works - choose based on team experience";
}Push vs. Pull Architectures
| Aspect | Push (Fan-out on write) | Pull (Fan-out on read) |
|---|---|---|
| When work happens | When data is written | When data is read |
| Read latency | Very fast (pre-computed) | Slower (computed on demand) |
| Write cost | High (fan-out to all followers) | Low (just store the item) |
| Storage | Higher (duplicated data in each feed) | Lower (single copy) |
| Best for | Users with few followers, read-heavy | Celebrity users with millions of followers |
Hybrid Approach (What Twitter Uses)
Use push for regular users (fan-out their tweets to followers' feeds on write) and pull for celebrities (fetch their tweets at read time and merge). This avoids the "celebrity problem" where a single tweet from a user with 50 million followers would require 50 million write operations.
How to Evaluate Trade-offs Systematically
When facing an architectural decision, use this structured approach to avoid gut-feel decisions and bias.
interface ArchitecturalDecision {
title: string;
context: string; // What is the situation?
options: Option[];
decision: string; // What did we choose?
rationale: string; // Why?
consequences: string[]; // What are the implications?
}
interface Option {
name: string;
pros: string[];
cons: string[];
score: {
complexity: number; // 1 (simple) to 5 (complex)
scalability: number; // 1 (poor) to 5 (excellent)
reliability: number; // 1 (poor) to 5 (excellent)
cost: number; // 1 (cheap) to 5 (expensive)
teamExperience: number; // 1 (none) to 5 (expert)
};
}
// Example ADR (Architecture Decision Record)
const example: ArchitecturalDecision = {
title: "Database for user sessions",
context: "We need to store user sessions with ~1M concurrent users, " +
"sub-10ms read latency, and automatic expiration.",
options: [
{
name: "Redis",
pros: [
"Sub-millisecond reads",
"Built-in TTL for expiration",
"Team has experience",
],
cons: [
"Data loss on restart (unless persistence enabled)",
"Memory cost for 1M sessions",
],
score: { complexity: 1, scalability: 4, reliability: 3, cost: 3, teamExperience: 5 },
},
{
name: "PostgreSQL",
pros: [
"Durable storage",
"Rich querying for analytics",
"Already in our stack",
],
cons: [
"Higher latency (~5ms)",
"Need to implement expiration (cron job or pg_cron)",
],
score: { complexity: 2, scalability: 3, reliability: 5, cost: 2, teamExperience: 5 },
},
],
decision: "Redis",
rationale: "Session data is ephemeral - durability is not required. " +
"The sub-ms latency is critical for user experience. " +
"Memory cost for 1M sessions (~500MB) is acceptable.",
consequences: [
"Must handle Redis failover (use Redis Sentinel or Cluster)",
"Cannot perform complex queries on session data",
"Need monitoring for memory usage",
],
};Decision Framework for Architects
The STAR Framework for Technical Decisions
- S - Situation: What are the constraints? Scale, team size, timeline, budget, existing systems
- T - Trade-offs: What are you gaining and what are you giving up with each option?
- A - Action: What is the recommendation and why? Make the decision reversible if possible
- R - Review: Set a date to review the decision. Was it correct? What would you change?
Key Principles
- Optimize for the common case. Design for the 95th percentile of your workload, not the edge cases
- Make decisions reversible. Prefer options that can be changed later over options that lock you in
- Boring technology is good technology. Use well-understood tools unless you have a compelling reason not to
- Measure, do not assume. Benchmark before optimizing. Profile before refactoring. Load test before scaling
- Document your decisions. Future engineers (including future you) will need to understand not just what you built, but why you built it that way
- There is no best architecture. There is only the best architecture for your specific constraints, requirements, and team. Context is everything