TechLead
Lesson 12 of 30
6 min read
System Design

Database Replication Strategies

Master database replication including master-slave, multi-master, synchronous vs async replication, consistency models, and conflict resolution

What is Database Replication?

Database replication is the process of copying and maintaining database objects across multiple database servers. Replication provides redundancy, improves read performance, and ensures high availability. In a distributed system, replication is not optional -- it is fundamental to building reliable, scalable applications.

The core challenge of replication is keeping data consistent across nodes while maintaining performance. Different replication strategies make different trade-offs between consistency, availability, and latency.

Why Replicate?

  • High Availability: If one node fails, others can continue serving requests
  • Read Scalability: Distribute read queries across multiple replicas
  • Geographic Distribution: Place data closer to users in different regions
  • Disaster Recovery: Maintain copies in different data centers or availability zones
  • Analytics Isolation: Run heavy analytical queries on replicas without impacting production

Master-Slave (Primary-Replica) Replication

The most common replication topology. One node is designated as the master (primary) and handles all write operations. One or more slave (replica) nodes receive copies of the data and serve read queries.

How It Works

  • All writes go to the master node
  • The master writes changes to a replication log (binlog in MySQL, WAL in PostgreSQL)
  • Replicas consume the log and apply changes to their local copies
  • Read queries can be served by any replica or the master
// Master-Slave routing logic
interface DatabasePool {
  master: DatabaseConnection;
  replicas: DatabaseConnection[];
}

class ReplicaRouter {
  private pool: DatabasePool;
  private currentReplica = 0;

  constructor(pool: DatabasePool) {
    this.pool = pool;
  }

  // All writes go to master
  async write(query: string, params: any[]): Promise<any> {
    return this.pool.master.execute(query, params);
  }

  // Reads are distributed across replicas (round-robin)
  async read(query: string, params: any[]): Promise<any> {
    const replica = this.pool.replicas[this.currentReplica];
    this.currentReplica = (this.currentReplica + 1) % this.pool.replicas.length;
    return replica.execute(query, params);
  }

  // For reads that must see latest writes (read-after-write consistency)
  async readFromMaster(query: string, params: any[]): Promise<any> {
    return this.pool.master.execute(query, params);
  }
}

Master-Slave Limitations

  • Single Write Bottleneck: All writes must go through one node, limiting write throughput
  • Replication Lag: Replicas may serve stale data during high write loads
  • Failover Complexity: Promoting a replica to master requires coordination and may cause brief downtime

Master-Master (Multi-Master) Replication

In multi-master replication, two or more nodes accept write operations. Each master replicates its changes to the other masters. This eliminates the single write bottleneck but introduces the challenge of write conflicts.

When to Use Multi-Master

  • Multi-region deployments: Each region has a local master to minimize write latency
  • High write throughput: When a single master cannot handle the write volume
  • Active-active setups: Both data centers actively serve traffic simultaneously

Replication Topology Comparison

Aspect Master-Slave Multi-Master
Write NodesSingleMultiple
Write ScalabilityLimited by one nodeScales horizontally
Conflict HandlingNo conflicts (single writer)Must handle conflicts
ConsistencyEasier to maintainEventually consistent by default
FailoverRequires promotionAutomatic (other masters available)
ComplexitySimplerSignificantly more complex

Synchronous vs Asynchronous Replication

The timing of when data is replicated to secondary nodes is a critical design decision that affects both consistency and performance.

Synchronous Replication

The master waits for at least one replica to confirm it has written the data before acknowledging the write to the client. This guarantees that the data exists on multiple nodes but adds latency to every write operation.

Asynchronous Replication

The master acknowledges the write immediately after writing locally and replicates to followers in the background. This provides lower write latency but risks data loss if the master fails before replication completes.

Semi-Synchronous Replication

A practical middle ground: the master waits for at least one replica to confirm, but not all replicas. MySQL supports this natively with rpl_semi_sync_master_wait_for_slave_count.

// Illustrating sync vs async replication behavior

// Synchronous: write is only confirmed after replica acknowledges
async function syncWrite(master: DB, replicas: DB[], data: Record<string, any>): Promise<void> {
  // Write to master
  await master.insert(data);

  // Wait for at least one replica to confirm
  await Promise.any(
    replicas.map((replica) => replica.applyReplicationEvent(data))
  );

  // Only now confirm to the client
  return; // Write confirmed
}

// Asynchronous: write is confirmed immediately, replication is background
async function asyncWrite(master: DB, replicas: DB[], data: Record<string, any>): Promise<void> {
  // Write to master
  await master.insert(data);

  // Confirm to client immediately
  // Replication happens in background
  Promise.allSettled(
    replicas.map((replica) => replica.applyReplicationEvent(data))
  ).catch((err) => console.error("Replication failed:", err));

  return; // Write confirmed before replication
}

Replication Lag and Eventual Consistency

Replication lag is the delay between a write on the master and when that write becomes visible on replicas. In asynchronous replication, lag can range from milliseconds to seconds (or even minutes under heavy load). This means replicas serve eventually consistent data.

Common Problems Caused by Replication Lag

  • Read-after-write inconsistency: A user writes data and immediately reads it back from a replica that has not yet received the update
  • Monotonic read violations: A user sees a newer value, then on the next request hits a different, more lagged replica and sees an older value
  • Causality violations: A reply to a comment appears before the original comment on a lagged replica

Mitigating Replication Lag

class ConsistentReader {
  private pool: DatabasePool;

  // Strategy 1: Read-after-write consistency
  // After a write, read from master for a short window
  async readAfterWrite(
    userId: string,
    query: string,
    params: any[],
    lastWriteTimestamp: number
  ): Promise<any> {
    const lagWindow = 5000; // 5 seconds
    const timeSinceWrite = Date.now() - lastWriteTimestamp;

    if (timeSinceWrite < lagWindow) {
      // Recent write, read from master
      return this.pool.master.execute(query, params);
    }

    // Safe to read from replica
    return this.readFromReplica(query, params);
  }

  // Strategy 2: Monotonic reads
  // Always route a user to the same replica within a session
  async monotonicRead(
    sessionId: string,
    query: string,
    params: any[]
  ): Promise<any> {
    const replicaIndex = this.hashToReplica(sessionId);
    return this.pool.replicas[replicaIndex].execute(query, params);
  }

  private hashToReplica(key: string): number {
    let hash = 0;
    for (let i = 0; i < key.length; i++) {
      hash = (hash * 31 + key.charCodeAt(i)) % this.pool.replicas.length;
    }
    return hash;
  }
}

Conflict Resolution Strategies

In multi-master replication, two nodes can modify the same row simultaneously, creating a write conflict. There are several strategies to resolve conflicts.

  • Last Write Wins (LWW): The write with the latest timestamp wins. Simple but can lose data. Used by Cassandra.
  • Merge/Application-level Resolution: The application defines custom merge logic. For example, a shopping cart could take the union of items from both versions.
  • CRDTs (Conflict-free Replicated Data Types): Data structures designed to be merged automatically without conflicts. Used in collaborative editing tools.
  • Version Vectors: Track the causal history of each update to detect and resolve concurrent modifications. Used by DynamoDB and Riak.
// Last Write Wins conflict resolution
interface VersionedRecord {
  data: Record<string, any>;
  timestamp: number;
  nodeId: string;
}

function resolveConflictLWW(a: VersionedRecord, b: VersionedRecord): VersionedRecord {
  if (a.timestamp > b.timestamp) return a;
  if (b.timestamp > a.timestamp) return b;
  // Tiebreaker: higher node ID wins
  return a.nodeId > b.nodeId ? a : b;
}

// Application-level merge for a shopping cart
interface CartItem { productId: string; quantity: number; }

function mergeCart(cartA: CartItem[], cartB: CartItem[]): CartItem[] {
  const merged = new Map<string, number>();
  for (const item of [...cartA, ...cartB]) {
    const current = merged.get(item.productId) || 0;
    merged.set(item.productId, Math.max(current, item.quantity));
  }
  return Array.from(merged.entries()).map(([productId, quantity]) => ({
    productId,
    quantity,
  }));
}

Replication Tools and Real-World Examples

Database Replication Method Notes
PostgreSQLStreaming replication (WAL)Supports sync, async, and logical replication
MySQLBinlog replicationRow-based, statement-based, or mixed; supports Group Replication for multi-master
MongoDBReplica setsAutomatic failover with elections; oplog-based replication
CassandraPeer-to-peer (masterless)All nodes are equal; tunable consistency with quorum reads/writes
CockroachDBRaft consensusStrongly consistent, distributed SQL with automatic replication

Interview Tips

  • Default to master-slave for most interview problems unless write scaling is a stated requirement
  • Always mention replication lag and how your design handles read-after-write consistency
  • Know the CAP theorem implications: replication forces you to choose between consistency and availability during partitions
  • Discuss failover: what happens when the master goes down? Manual vs automatic promotion

Continue Learning