What is Database Replication?
Database replication is the process of copying and maintaining database objects across multiple database servers. Replication provides redundancy, improves read performance, and ensures high availability. In a distributed system, replication is not optional -- it is fundamental to building reliable, scalable applications.
The core challenge of replication is keeping data consistent across nodes while maintaining performance. Different replication strategies make different trade-offs between consistency, availability, and latency.
Why Replicate?
- High Availability: If one node fails, others can continue serving requests
- Read Scalability: Distribute read queries across multiple replicas
- Geographic Distribution: Place data closer to users in different regions
- Disaster Recovery: Maintain copies in different data centers or availability zones
- Analytics Isolation: Run heavy analytical queries on replicas without impacting production
Master-Slave (Primary-Replica) Replication
The most common replication topology. One node is designated as the master (primary) and handles all write operations. One or more slave (replica) nodes receive copies of the data and serve read queries.
How It Works
- All writes go to the master node
- The master writes changes to a replication log (binlog in MySQL, WAL in PostgreSQL)
- Replicas consume the log and apply changes to their local copies
- Read queries can be served by any replica or the master
// Master-Slave routing logic
interface DatabasePool {
master: DatabaseConnection;
replicas: DatabaseConnection[];
}
class ReplicaRouter {
private pool: DatabasePool;
private currentReplica = 0;
constructor(pool: DatabasePool) {
this.pool = pool;
}
// All writes go to master
async write(query: string, params: any[]): Promise<any> {
return this.pool.master.execute(query, params);
}
// Reads are distributed across replicas (round-robin)
async read(query: string, params: any[]): Promise<any> {
const replica = this.pool.replicas[this.currentReplica];
this.currentReplica = (this.currentReplica + 1) % this.pool.replicas.length;
return replica.execute(query, params);
}
// For reads that must see latest writes (read-after-write consistency)
async readFromMaster(query: string, params: any[]): Promise<any> {
return this.pool.master.execute(query, params);
}
}
Master-Slave Limitations
- Single Write Bottleneck: All writes must go through one node, limiting write throughput
- Replication Lag: Replicas may serve stale data during high write loads
- Failover Complexity: Promoting a replica to master requires coordination and may cause brief downtime
Master-Master (Multi-Master) Replication
In multi-master replication, two or more nodes accept write operations. Each master replicates its changes to the other masters. This eliminates the single write bottleneck but introduces the challenge of write conflicts.
When to Use Multi-Master
- Multi-region deployments: Each region has a local master to minimize write latency
- High write throughput: When a single master cannot handle the write volume
- Active-active setups: Both data centers actively serve traffic simultaneously
Replication Topology Comparison
| Aspect | Master-Slave | Multi-Master |
|---|---|---|
| Write Nodes | Single | Multiple |
| Write Scalability | Limited by one node | Scales horizontally |
| Conflict Handling | No conflicts (single writer) | Must handle conflicts |
| Consistency | Easier to maintain | Eventually consistent by default |
| Failover | Requires promotion | Automatic (other masters available) |
| Complexity | Simpler | Significantly more complex |
Synchronous vs Asynchronous Replication
The timing of when data is replicated to secondary nodes is a critical design decision that affects both consistency and performance.
Synchronous Replication
The master waits for at least one replica to confirm it has written the data before acknowledging the write to the client. This guarantees that the data exists on multiple nodes but adds latency to every write operation.
Asynchronous Replication
The master acknowledges the write immediately after writing locally and replicates to followers in the background. This provides lower write latency but risks data loss if the master fails before replication completes.
Semi-Synchronous Replication
A practical middle ground: the master waits for at least one replica to confirm, but not all replicas. MySQL supports this natively with rpl_semi_sync_master_wait_for_slave_count.
// Illustrating sync vs async replication behavior
// Synchronous: write is only confirmed after replica acknowledges
async function syncWrite(master: DB, replicas: DB[], data: Record<string, any>): Promise<void> {
// Write to master
await master.insert(data);
// Wait for at least one replica to confirm
await Promise.any(
replicas.map((replica) => replica.applyReplicationEvent(data))
);
// Only now confirm to the client
return; // Write confirmed
}
// Asynchronous: write is confirmed immediately, replication is background
async function asyncWrite(master: DB, replicas: DB[], data: Record<string, any>): Promise<void> {
// Write to master
await master.insert(data);
// Confirm to client immediately
// Replication happens in background
Promise.allSettled(
replicas.map((replica) => replica.applyReplicationEvent(data))
).catch((err) => console.error("Replication failed:", err));
return; // Write confirmed before replication
}
Replication Lag and Eventual Consistency
Replication lag is the delay between a write on the master and when that write becomes visible on replicas. In asynchronous replication, lag can range from milliseconds to seconds (or even minutes under heavy load). This means replicas serve eventually consistent data.
Common Problems Caused by Replication Lag
- Read-after-write inconsistency: A user writes data and immediately reads it back from a replica that has not yet received the update
- Monotonic read violations: A user sees a newer value, then on the next request hits a different, more lagged replica and sees an older value
- Causality violations: A reply to a comment appears before the original comment on a lagged replica
Mitigating Replication Lag
class ConsistentReader {
private pool: DatabasePool;
// Strategy 1: Read-after-write consistency
// After a write, read from master for a short window
async readAfterWrite(
userId: string,
query: string,
params: any[],
lastWriteTimestamp: number
): Promise<any> {
const lagWindow = 5000; // 5 seconds
const timeSinceWrite = Date.now() - lastWriteTimestamp;
if (timeSinceWrite < lagWindow) {
// Recent write, read from master
return this.pool.master.execute(query, params);
}
// Safe to read from replica
return this.readFromReplica(query, params);
}
// Strategy 2: Monotonic reads
// Always route a user to the same replica within a session
async monotonicRead(
sessionId: string,
query: string,
params: any[]
): Promise<any> {
const replicaIndex = this.hashToReplica(sessionId);
return this.pool.replicas[replicaIndex].execute(query, params);
}
private hashToReplica(key: string): number {
let hash = 0;
for (let i = 0; i < key.length; i++) {
hash = (hash * 31 + key.charCodeAt(i)) % this.pool.replicas.length;
}
return hash;
}
}
Conflict Resolution Strategies
In multi-master replication, two nodes can modify the same row simultaneously, creating a write conflict. There are several strategies to resolve conflicts.
- Last Write Wins (LWW): The write with the latest timestamp wins. Simple but can lose data. Used by Cassandra.
- Merge/Application-level Resolution: The application defines custom merge logic. For example, a shopping cart could take the union of items from both versions.
- CRDTs (Conflict-free Replicated Data Types): Data structures designed to be merged automatically without conflicts. Used in collaborative editing tools.
- Version Vectors: Track the causal history of each update to detect and resolve concurrent modifications. Used by DynamoDB and Riak.
// Last Write Wins conflict resolution
interface VersionedRecord {
data: Record<string, any>;
timestamp: number;
nodeId: string;
}
function resolveConflictLWW(a: VersionedRecord, b: VersionedRecord): VersionedRecord {
if (a.timestamp > b.timestamp) return a;
if (b.timestamp > a.timestamp) return b;
// Tiebreaker: higher node ID wins
return a.nodeId > b.nodeId ? a : b;
}
// Application-level merge for a shopping cart
interface CartItem { productId: string; quantity: number; }
function mergeCart(cartA: CartItem[], cartB: CartItem[]): CartItem[] {
const merged = new Map<string, number>();
for (const item of [...cartA, ...cartB]) {
const current = merged.get(item.productId) || 0;
merged.set(item.productId, Math.max(current, item.quantity));
}
return Array.from(merged.entries()).map(([productId, quantity]) => ({
productId,
quantity,
}));
}
Replication Tools and Real-World Examples
| Database | Replication Method | Notes |
|---|---|---|
| PostgreSQL | Streaming replication (WAL) | Supports sync, async, and logical replication |
| MySQL | Binlog replication | Row-based, statement-based, or mixed; supports Group Replication for multi-master |
| MongoDB | Replica sets | Automatic failover with elections; oplog-based replication |
| Cassandra | Peer-to-peer (masterless) | All nodes are equal; tunable consistency with quorum reads/writes |
| CockroachDB | Raft consensus | Strongly consistent, distributed SQL with automatic replication |
Interview Tips
- Default to master-slave for most interview problems unless write scaling is a stated requirement
- Always mention replication lag and how your design handles read-after-write consistency
- Know the CAP theorem implications: replication forces you to choose between consistency and availability during partitions
- Discuss failover: what happens when the master goes down? Manual vs automatic promotion