TechLead
Lesson 30 of 30
8 min read
System Design

Trade-offs in System Design

Master trade-off analysis in system design covering consistency vs availability, SQL vs NoSQL, monolith vs microservices, and decision frameworks.

Trade-offs in System Design

Every system design decision involves trade-offs. There is no perfect architecture, only architectures that are optimized for specific constraints. Senior engineers distinguish themselves not by knowing more technologies, but by their ability to evaluate trade-offs systematically and make defensible decisions. This topic covers the most important trade-offs you will encounter and frameworks for evaluating them.

Consistency vs. Availability

The CAP theorem states that in the presence of a network partition, a distributed system must choose between consistency and availability. In practice, partitions are inevitable, so the real question is: when a partition occurs, do you return stale data (availability) or an error (consistency)?

Choose Consistency (CP) Choose Availability (AP)
Financial transactions Social media feeds
Inventory management Product catalog browsing
User authentication Content recommendations
Leader election Analytics dashboards
Distributed locks DNS resolution

The Real-World Nuance

Most real systems are not purely CP or AP. They use different consistency levels for different operations. An e-commerce platform might use strong consistency for payment processing (CP) but eventual consistency for product reviews (AP). The key insight is that consistency is not a system-wide property; it is a per-operation decision.

Strong vs. Eventual Consistency

This is a spectrum, not a binary choice. Understanding the options between the two extremes is crucial.

// Consistency spectrum from strongest to weakest

enum ConsistencyLevel {
  // Linearizability: reads always see the most recent write
  // Slowest, requires consensus (Raft/Paxos)
  LINEARIZABLE = "LINEARIZABLE",

  // Sequential consistency: operations appear in some total order
  // consistent with each client's program order
  SEQUENTIAL = "SEQUENTIAL",

  // Causal consistency: causally related operations are seen in order
  // Concurrent operations may be seen in any order
  CAUSAL = "CAUSAL",

  // Read-your-writes: a client always sees its own writes
  // Other clients may see stale data
  READ_YOUR_WRITES = "READ_YOUR_WRITES",

  // Eventual consistency: all replicas converge eventually
  // No ordering guarantees in the meantime
  EVENTUAL = "EVENTUAL",
}

// Practical example: user profile update
interface UserProfileService {
  // Strong consistency: user always sees their own updates immediately
  // Implementation: read from primary database
  updateProfile(userId: string, data: ProfileData): Promise<void>;
  getOwnProfile(userId: string): Promise<ProfileData>; // Read from primary

  // Eventual consistency: other users may see stale profile
  // Implementation: read from nearest replica
  getPublicProfile(userId: string): Promise<ProfileData>; // Read from replica
}

Latency vs. Throughput

Optimizing for latency (how fast a single request completes) often conflicts with optimizing for throughput (how many requests per second the system handles).

Optimize for Latency Optimize for Throughput
Process each request immediately Batch requests for processing
Keep data in memory Sequential disk writes (append-only)
Use caching aggressively Use queues to smooth traffic spikes
Fewer network hops Pipeline operations across services
Real-time processing Batch / stream processing
// Example: Database write strategies

// Optimize for latency: write-through
async function writeThrough(key: string, value: string): Promise<void> {
  // Write to DB synchronously, return when confirmed
  await database.write(key, value);  // 5ms
  await cache.set(key, value);       // 1ms
  // Total: ~6ms per write, but only handles ~160 writes/sec per connection
}

// Optimize for throughput: write-behind (buffering)
class WriteBuffer {
  private buffer: Map<string, string> = new Map();
  private flushInterval = 100; // ms

  async write(key: string, value: string): Promise<void> {
    this.buffer.set(key, value);
    await cache.set(key, value); // Immediate cache update
    // Return immediately - data will be flushed to DB asynchronously
    // Total: ~1ms per write, can handle thousands of writes/sec
    // Trade-off: data loss risk if process crashes before flush
  }

  private async flush(): Promise<void> {
    const batch = Array.from(this.buffer.entries());
    this.buffer.clear();
    await database.batchWrite(batch); // One DB call for many writes
  }
}

Monolith vs. Microservices

This is perhaps the most debated trade-off in software architecture. The right answer depends heavily on team size, system complexity, and organizational structure.

Dimension Monolith Microservices
Deployment Single deployable unit Independent deployment per service
Development speed (small team) Faster - less overhead Slower - infrastructure complexity
Development speed (large team) Slower - merge conflicts, coordination Faster - teams work independently
Debugging Easier - single process, local stack traces Harder - distributed tracing required
Scaling Scale the entire application Scale individual services independently
Data consistency Easy - single database, ACID transactions Hard - distributed transactions, eventual consistency
Technology flexibility One tech stack Best tool for each service
Best for team size 1-20 engineers 50+ engineers with clear domain boundaries

The Pragmatic Middle Ground

Start with a modular monolith: a single deployable unit with well-defined internal module boundaries. When a specific module needs independent scaling or a separate team, extract it into a service. This approach gives you the simplicity of a monolith with a path to microservices when the need arises. Premature decomposition into microservices is one of the most common and expensive architectural mistakes.

SQL vs. NoSQL

Factor SQL (PostgreSQL, MySQL) NoSQL (MongoDB, Cassandra, DynamoDB)
Data model Rigid schema, relational Flexible schema, document/key-value/column
Query flexibility Rich queries with JOINs, aggregations Limited queries, optimized for specific access patterns
Transactions Full ACID across multiple tables Limited (single document or partition)
Horizontal scaling Possible but complex Built-in, often automatic
Best for Complex relationships, ad-hoc queries, data integrity High write throughput, flexible schemas, known access patterns
// Decision guide
function chooseDatabaseType(requirements: {
  needsJoins: boolean;
  needsACID: boolean;
  schemaIsStable: boolean;
  writeVolume: "low" | "medium" | "high" | "extreme";
  queryPatterns: "varied" | "known";
  dataRelationships: "simple" | "complex";
}): string {
  // Strong signals for SQL
  if (requirements.needsACID && requirements.needsJoins) {
    return "SQL (PostgreSQL recommended)";
  }

  // Strong signals for NoSQL
  if (requirements.writeVolume === "extreme" &&
      requirements.queryPatterns === "known" &&
      !requirements.needsJoins) {
    return "NoSQL (Cassandra for wide-column, DynamoDB for key-value)";
  }

  // Default recommendation
  if (requirements.dataRelationships === "complex") {
    return "SQL - complex relationships are hard to model in NoSQL";
  }

  return "Either works - choose based on team experience";
}

Push vs. Pull Architectures

Aspect Push (Fan-out on write) Pull (Fan-out on read)
When work happens When data is written When data is read
Read latency Very fast (pre-computed) Slower (computed on demand)
Write cost High (fan-out to all followers) Low (just store the item)
Storage Higher (duplicated data in each feed) Lower (single copy)
Best for Users with few followers, read-heavy Celebrity users with millions of followers

Hybrid Approach (What Twitter Uses)

Use push for regular users (fan-out their tweets to followers' feeds on write) and pull for celebrities (fetch their tweets at read time and merge). This avoids the "celebrity problem" where a single tweet from a user with 50 million followers would require 50 million write operations.

How to Evaluate Trade-offs Systematically

When facing an architectural decision, use this structured approach to avoid gut-feel decisions and bias.

interface ArchitecturalDecision {
  title: string;
  context: string;          // What is the situation?
  options: Option[];
  decision: string;         // What did we choose?
  rationale: string;        // Why?
  consequences: string[];   // What are the implications?
}

interface Option {
  name: string;
  pros: string[];
  cons: string[];
  score: {
    complexity: number;     // 1 (simple) to 5 (complex)
    scalability: number;    // 1 (poor) to 5 (excellent)
    reliability: number;    // 1 (poor) to 5 (excellent)
    cost: number;           // 1 (cheap) to 5 (expensive)
    teamExperience: number; // 1 (none) to 5 (expert)
  };
}

// Example ADR (Architecture Decision Record)
const example: ArchitecturalDecision = {
  title: "Database for user sessions",
  context: "We need to store user sessions with ~1M concurrent users, " +
           "sub-10ms read latency, and automatic expiration.",
  options: [
    {
      name: "Redis",
      pros: [
        "Sub-millisecond reads",
        "Built-in TTL for expiration",
        "Team has experience",
      ],
      cons: [
        "Data loss on restart (unless persistence enabled)",
        "Memory cost for 1M sessions",
      ],
      score: { complexity: 1, scalability: 4, reliability: 3, cost: 3, teamExperience: 5 },
    },
    {
      name: "PostgreSQL",
      pros: [
        "Durable storage",
        "Rich querying for analytics",
        "Already in our stack",
      ],
      cons: [
        "Higher latency (~5ms)",
        "Need to implement expiration (cron job or pg_cron)",
      ],
      score: { complexity: 2, scalability: 3, reliability: 5, cost: 2, teamExperience: 5 },
    },
  ],
  decision: "Redis",
  rationale: "Session data is ephemeral - durability is not required. " +
             "The sub-ms latency is critical for user experience. " +
             "Memory cost for 1M sessions (~500MB) is acceptable.",
  consequences: [
    "Must handle Redis failover (use Redis Sentinel or Cluster)",
    "Cannot perform complex queries on session data",
    "Need monitoring for memory usage",
  ],
};

Decision Framework for Architects

The STAR Framework for Technical Decisions

  • S - Situation: What are the constraints? Scale, team size, timeline, budget, existing systems
  • T - Trade-offs: What are you gaining and what are you giving up with each option?
  • A - Action: What is the recommendation and why? Make the decision reversible if possible
  • R - Review: Set a date to review the decision. Was it correct? What would you change?

Key Principles

  • Optimize for the common case. Design for the 95th percentile of your workload, not the edge cases
  • Make decisions reversible. Prefer options that can be changed later over options that lock you in
  • Boring technology is good technology. Use well-understood tools unless you have a compelling reason not to
  • Measure, do not assume. Benchmark before optimizing. Profile before refactoring. Load test before scaling
  • Document your decisions. Future engineers (including future you) will need to understand not just what you built, but why you built it that way
  • There is no best architecture. There is only the best architecture for your specific constraints, requirements, and team. Context is everything

Continue Learning