What is Database Replication Strategies?

Master database replication including master-slave, multi-master, synchronous vs async replication, consistency models, and conflict resolution

Database Replication Strategies - System Design Tutorial | TechLead

Q: Master-Master (Multi-Master) Replication

In multi-master replication, two or more nodes accept write operations. Each master replicates its changes to the other masters. This eliminates the single write bottleneck but introduces the challenge of write conflicts .

What is Database Replication?

Database replication is the process of copying and maintaining database objects across multiple database servers. Replication provides redundancy, improves read performance, and ensures high availability. In a distributed system, replication is not optional -- it is fundamental to building reliable, scalable applications.

The core challenge of replication is keeping data consistent across nodes while maintaining performance. Different replication strategies make different trade-offs between consistency, availability, and latency.

Why Replicate?

High Availability: If one node fails, others can continue serving requests
Read Scalability: Distribute read queries across multiple replicas
Geographic Distribution: Place data closer to users in different regions
Disaster Recovery: Maintain copies in different data centers or availability zones
Analytics Isolation: Run heavy analytical queries on replicas without impacting production

Master-Slave (Primary-Replica) Replication

The most common replication topology. One node is designated as the master (primary) and handles all write operations. One or more slave (replica) nodes receive copies of the data and serve read queries.

How It Works

All writes go to the master node
The master writes changes to a replication log (binlog in MySQL, WAL in PostgreSQL)
Replicas consume the log and apply changes to their local copies
Read queries can be served by any replica or the master

// Master-Slave routing logic
interface DatabasePool {
  master: DatabaseConnection;
  replicas: DatabaseConnection[];
}

class ReplicaRouter {
  private pool: DatabasePool;
  private currentReplica = 0;

  constructor(pool: DatabasePool) {
    this.pool = pool;
  }

  // All writes go to master
  async write(query: string, params: any[]): Promise<any> {
    return this.pool.master.execute(query, params);
  }

  // Reads are distributed across replicas (round-robin)
  async read(query: string, params: any[]): Promise<any> {
    const replica = this.pool.replicas[this.currentReplica];
    this.currentReplica = (this.currentReplica + 1) % this.pool.replicas.length;
    return replica.execute(query, params);
  }

  // For reads that must see latest writes (read-after-write consistency)
  async readFromMaster(query: string, params: any[]): Promise<any> {
    return this.pool.master.execute(query, params);
  }
}

Master-Slave Limitations

Single Write Bottleneck: All writes must go through one node, limiting write throughput
Replication Lag: Replicas may serve stale data during high write loads
Failover Complexity: Promoting a replica to master requires coordination and may cause brief downtime

Master-Master (Multi-Master) Replication

In multi-master replication, two or more nodes accept write operations. Each master replicates its changes to the other masters. This eliminates the single write bottleneck but introduces the challenge of write conflicts.

When to Use Multi-Master

Multi-region deployments: Each region has a local master to minimize write latency
High write throughput: When a single master cannot handle the write volume
Active-active setups: Both data centers actively serve traffic simultaneously

Replication Topology Comparison

Aspect	Master-Slave	Multi-Master
Write Nodes	Single	Multiple
Write Scalability	Limited by one node	Scales horizontally
Conflict Handling	No conflicts (single writer)	Must handle conflicts
Consistency	Easier to maintain	Eventually consistent by default
Failover	Requires promotion	Automatic (other masters available)
Complexity	Simpler	Significantly more complex

Synchronous vs Asynchronous Replication

The timing of when data is replicated to secondary nodes is a critical design decision that affects both consistency and performance.

Synchronous Replication

The master waits for at least one replica to confirm it has written the data before acknowledging the write to the client. This guarantees that the data exists on multiple nodes but adds latency to every write operation.

Asynchronous Replication

The master acknowledges the write immediately after writing locally and replicates to followers in the background. This provides lower write latency but risks data loss if the master fails before replication completes.

Semi-Synchronous Replication

A practical middle ground: the master waits for at least one replica to confirm, but not all replicas. MySQL supports this natively with rpl_semi_sync_master_wait_for_slave_count.

// Illustrating sync vs async replication behavior

// Synchronous: write is only confirmed after replica acknowledges
async function syncWrite(master: DB, replicas: DB[], data: Record<string, any>): Promise<void> {
  // Write to master
  await master.insert(data);

  // Wait for at least one replica to confirm
  await Promise.any(
    replicas.map((replica) => replica.applyReplicationEvent(data))
  );

  // Only now confirm to the client
  return; // Write confirmed
}

// Asynchronous: write is confirmed immediately, replication is background
async function asyncWrite(master: DB, replicas: DB[], data: Record<string, any>): Promise<void> {
  // Write to master
  await master.insert(data);

  // Confirm to client immediately
  // Replication happens in background
  Promise.allSettled(
    replicas.map((replica) => replica.applyReplicationEvent(data))
  ).catch((err) => console.error("Replication failed:", err));

  return; // Write confirmed before replication
}

Replication Lag and Eventual Consistency

Replication lag is the delay between a write on the master and when that write becomes visible on replicas. In asynchronous replication, lag can range from milliseconds to seconds (or even minutes under heavy load). This means replicas serve eventually consistent data.

Common Problems Caused by Replication Lag

Read-after-write inconsistency: A user writes data and immediately reads it back from a replica that has not yet received the update
Monotonic read violations: A user sees a newer value, then on the next request hits a different, more lagged replica and sees an older value
Causality violations: A reply to a comment appears before the original comment on a lagged replica

Mitigating Replication Lag

class ConsistentReader {
  private pool: DatabasePool;

  // Strategy 1: Read-after-write consistency
  // After a write, read from master for a short window
  async readAfterWrite(
    userId: string,
    query: string,
    params: any[],
    lastWriteTimestamp: number
  ): Promise<any> {
    const lagWindow = 5000; // 5 seconds
    const timeSinceWrite = Date.now() - lastWriteTimestamp;

    if (timeSinceWrite < lagWindow) {
      // Recent write, read from master
      return this.pool.master.execute(query, params);
    }

    // Safe to read from replica
    return this.readFromReplica(query, params);
  }

  // Strategy 2: Monotonic reads
  // Always route a user to the same replica within a session
  async monotonicRead(
    sessionId: string,
    query: string,
    params: any[]
  ): Promise<any> {
    const replicaIndex = this.hashToReplica(sessionId);
    return this.pool.replicas[replicaIndex].execute(query, params);
  }

  private hashToReplica(key: string): number {
    let hash = 0;
    for (let i = 0; i < key.length; i++) {
      hash = (hash * 31 + key.charCodeAt(i)) % this.pool.replicas.length;
    }
    return hash;
  }
}

Conflict Resolution Strategies

In multi-master replication, two nodes can modify the same row simultaneously, creating a write conflict. There are several strategies to resolve conflicts.

Last Write Wins (LWW): The write with the latest timestamp wins. Simple but can lose data. Used by Cassandra.
Merge/Application-level Resolution: The application defines custom merge logic. For example, a shopping cart could take the union of items from both versions.
CRDTs (Conflict-free Replicated Data Types): Data structures designed to be merged automatically without conflicts. Used in collaborative editing tools.
Version Vectors: Track the causal history of each update to detect and resolve concurrent modifications. Used by DynamoDB and Riak.

// Last Write Wins conflict resolution
interface VersionedRecord {
  data: Record<string, any>;
  timestamp: number;
  nodeId: string;
}

function resolveConflictLWW(a: VersionedRecord, b: VersionedRecord): VersionedRecord {
  if (a.timestamp > b.timestamp) return a;
  if (b.timestamp > a.timestamp) return b;
  // Tiebreaker: higher node ID wins
  return a.nodeId > b.nodeId ? a : b;
}

// Application-level merge for a shopping cart
interface CartItem { productId: string; quantity: number; }

function mergeCart(cartA: CartItem[], cartB: CartItem[]): CartItem[] {
  const merged = new Map<string, number>();
  for (const item of [...cartA, ...cartB]) {
    const current = merged.get(item.productId) || 0;
    merged.set(item.productId, Math.max(current, item.quantity));
  }
  return Array.from(merged.entries()).map(([productId, quantity]) => ({
    productId,
    quantity,
  }));
}

Replication Tools and Real-World Examples

Database	Replication Method	Notes
PostgreSQL	Streaming replication (WAL)	Supports sync, async, and logical replication
MySQL	Binlog replication	Row-based, statement-based, or mixed; supports Group Replication for multi-master
MongoDB	Replica sets	Automatic failover with elections; oplog-based replication
Cassandra	Peer-to-peer (masterless)	All nodes are equal; tunable consistency with quorum reads/writes
CockroachDB	Raft consensus	Strongly consistent, distributed SQL with automatic replication

Interview Tips

Default to master-slave for most interview problems unless write scaling is a stated requirement
Always mention replication lag and how your design handles read-after-write consistency
Know the CAP theorem implications: replication forces you to choose between consistency and availability during partitions
Discuss failover: what happens when the master goes down? Manual vs automatic promotion

Database Replication Strategies