TechLead
Lesson 26 of 30
6 min read
System Design

Distributed Consensus Algorithms

Understand distributed consensus algorithms including Paxos, Raft, leader election, and log replication used in ZooKeeper and etcd.

Distributed Consensus Algorithms

Distributed consensus is one of the hardest problems in computer science. It asks: how can a group of machines agree on a value, even when some machines crash or messages are delayed? Consensus is the foundation of replicated state machines, distributed databases, and coordination services. Without consensus, you cannot build reliable distributed systems.

The Consensus Problem

A consensus algorithm must satisfy three properties:

  • Agreement: All non-faulty nodes decide on the same value
  • Validity: The decided value must have been proposed by some node
  • Termination: All non-faulty nodes eventually decide on a value

FLP Impossibility

Fischer, Lynch, and Paterson proved in 1985 that no deterministic consensus algorithm can guarantee all three properties in an asynchronous system where even one process can crash. In practice, algorithms circumvent this impossibility by using timeouts (partial synchrony) or randomization to ensure progress.

Paxos Algorithm (Simplified)

Paxos, invented by Leslie Lamport in 1989, was the first proven consensus algorithm. It is notoriously difficult to understand and implement. Here is the simplified single-decree version (consensus on a single value).

Paxos Roles

Role Responsibility
Proposer Proposes values and drives the protocol
Acceptor Votes on proposals and stores accepted values
Learner Learns the decided value once consensus is reached

Paxos Protocol Phases

// Phase 1: Prepare
// Proposer selects a proposal number n and sends Prepare(n) to a majority of acceptors

interface PrepareRequest {
  proposalNumber: number;
}

interface PrepareResponse {
  promise: boolean;           // Will the acceptor promise not to accept lower proposals?
  acceptedProposal?: number;  // Highest proposal number previously accepted
  acceptedValue?: string;     // Value from that proposal
}

// Phase 2: Accept
// If proposer receives promises from a majority:
// - If any acceptor already accepted a value, proposer must use the value
//   from the highest-numbered accepted proposal
// - Otherwise, proposer can use its own value
// Proposer sends Accept(n, value) to a majority of acceptors

interface AcceptRequest {
  proposalNumber: number;
  value: string;
}

interface AcceptResponse {
  accepted: boolean;
}

// Simplified Paxos Proposer
class PaxosProposer {
  private proposalNumber: number = 0;

  async propose(value: string, acceptors: Acceptor[]): Promise<string> {
    const majority = Math.floor(acceptors.length / 2) + 1;

    // Phase 1: Prepare
    this.proposalNumber++;
    const prepareResponses = await Promise.all(
      acceptors.map((a) => a.prepare({ proposalNumber: this.proposalNumber }))
    );

    const promises = prepareResponses.filter((r) => r.promise);
    if (promises.length < majority) {
      throw new Error("Failed to get majority promises");
    }

    // Use the value from the highest accepted proposal, or our own
    const highestAccepted = promises
      .filter((r) => r.acceptedProposal !== undefined)
      .sort((a, b) => (b.acceptedProposal || 0) - (a.acceptedProposal || 0))[0];

    const finalValue = highestAccepted?.acceptedValue || value;

    // Phase 2: Accept
    const acceptResponses = await Promise.all(
      acceptors.map((a) =>
        a.accept({ proposalNumber: this.proposalNumber, value: finalValue })
      )
    );

    const accepted = acceptResponses.filter((r) => r.accepted);
    if (accepted.length < majority) {
      throw new Error("Failed to get majority acceptance");
    }

    return finalValue; // Consensus reached!
  }
}

Raft Algorithm (Detailed)

Raft was designed by Diego Ongaro and John Ousterhout in 2014 as an understandable alternative to Paxos. It decomposes consensus into three sub-problems: leader election, log replication, and safety.

Raft Node States

Every node in a Raft cluster is in one of three states: Follower, Candidate, or Leader. All nodes start as followers.

enum RaftState {
  FOLLOWER = "FOLLOWER",
  CANDIDATE = "CANDIDATE",
  LEADER = "LEADER",
}

interface LogEntry {
  term: number;
  index: number;
  command: string;
}

interface RaftNode {
  id: string;
  state: RaftState;
  currentTerm: number;
  votedFor: string | null;
  log: LogEntry[];
  commitIndex: number;
  lastApplied: number;

  // Leader-only state
  nextIndex: Map<string, number>;  // For each follower: next log index to send
  matchIndex: Map<string, number>; // For each follower: highest replicated index
}

Leader Election

class RaftElection {
  private electionTimeout: number; // Random between 150-300ms
  private heartbeatInterval: number = 50; // Leader sends heartbeats

  startElection(node: RaftNode, peers: string[]): void {
    // 1. Increment current term
    node.currentTerm++;
    node.state = RaftState.CANDIDATE;
    node.votedFor = node.id; // Vote for self

    let votesReceived = 1; // Self-vote
    const majority = Math.floor(peers.length / 2) + 1;

    // 2. Send RequestVote to all peers
    for (const peerId of peers) {
      const response = this.sendRequestVote(peerId, {
        term: node.currentTerm,
        candidateId: node.id,
        lastLogIndex: node.log.length - 1,
        lastLogTerm: node.log.length > 0
          ? node.log[node.log.length - 1].term
          : 0,
      });

      if (response.voteGranted) {
        votesReceived++;
      }
      if (response.term > node.currentTerm) {
        // Discovered higher term: step down
        node.currentTerm = response.term;
        node.state = RaftState.FOLLOWER;
        return;
      }
    }

    // 3. Check if we won
    if (votesReceived >= majority) {
      node.state = RaftState.LEADER;
      this.initializeLeaderState(node, peers);
      this.sendHeartbeats(node, peers);
    }
  }

  // A node grants a vote if:
  // 1. The candidate's term >= the voter's term
  // 2. The voter hasn't voted for someone else in this term
  // 3. The candidate's log is at least as up-to-date as the voter's
  handleRequestVote(node: RaftNode, request: RequestVoteRequest): RequestVoteResponse {
    if (request.term < node.currentTerm) {
      return { term: node.currentTerm, voteGranted: false };
    }

    if (request.term > node.currentTerm) {
      node.currentTerm = request.term;
      node.state = RaftState.FOLLOWER;
      node.votedFor = null;
    }

    const logOk = request.lastLogTerm > this.getLastLogTerm(node) ||
      (request.lastLogTerm === this.getLastLogTerm(node) &&
       request.lastLogIndex >= node.log.length - 1);

    if ((node.votedFor === null || node.votedFor === request.candidateId) && logOk) {
      node.votedFor = request.candidateId;
      return { term: node.currentTerm, voteGranted: true };
    }

    return { term: node.currentTerm, voteGranted: false };
  }
}

Log Replication

Once a leader is elected, it accepts client requests, appends them to its log, and replicates the log entries to followers. An entry is considered committed once a majority of nodes have replicated it.

class RaftLogReplication {
  async replicateEntry(
    leader: RaftNode,
    command: string,
    peers: string[]
  ): Promise<boolean> {
    // 1. Append to leader's log
    const entry: LogEntry = {
      term: leader.currentTerm,
      index: leader.log.length,
      command,
    };
    leader.log.push(entry);

    // 2. Send AppendEntries to all followers
    let replicatedCount = 1; // Leader counts as one
    const majority = Math.floor(peers.length / 2) + 1;

    const results = await Promise.all(
      peers.map((peerId) =>
        this.sendAppendEntries(peerId, {
          term: leader.currentTerm,
          leaderId: leader.id,
          prevLogIndex: entry.index - 1,
          prevLogTerm: entry.index > 0 ? leader.log[entry.index - 1].term : 0,
          entries: [entry],
          leaderCommit: leader.commitIndex,
        })
      )
    );

    for (const result of results) {
      if (result.success) replicatedCount++;
    }

    // 3. If replicated to majority, commit
    if (replicatedCount >= majority) {
      leader.commitIndex = entry.index;
      return true; // Entry is committed and safe
    }

    return false;
  }
}

ZooKeeper and etcd

ZooKeeper and etcd are the two most widely used consensus-based coordination services.

Feature ZooKeeper etcd
Consensus Algorithm ZAB (ZooKeeper Atomic Broadcast) Raft
Language Java Go
Data Model Hierarchical (tree of znodes) Flat key-value store
Watch Mechanism One-time triggers (must re-register) Persistent watches with streaming
Primary Users Kafka, Hadoop, HBase Kubernetes, CoreDNS

Practical Applications

  • Leader election: Ensure only one instance of a service acts as leader (e.g., one scheduler, one primary database)
  • Configuration management: Store and distribute configuration with strong consistency guarantees
  • Distributed locking: Coordinate access to shared resources across multiple services
  • Service discovery: Register and discover service instances with health checking
  • Replicated state machines: Build fault-tolerant services that survive node failures by replicating the operation log
  • Metadata management: Store partition maps, cluster membership, and schema information

When NOT to Use Consensus

Consensus algorithms add latency (every write requires a majority quorum) and complexity. Do not use them for high-throughput data storage or when eventual consistency is acceptable. Use consensus for coordination metadata (who is the leader? what is the partition map?) and use simpler replication for bulk data.

Continue Learning