What is Distributed Consensus Algorithms?

Understand distributed consensus algorithms including Paxos, Raft, leader election, and log replication used in ZooKeeper and etcd.

Distributed Consensus Algorithms - System Design Tutorial | TechLead

Distributed Consensus Algorithms

Distributed consensus is one of the hardest problems in computer science. It asks: how can a group of machines agree on a value, even when some machines crash or messages are delayed? Consensus is the foundation of replicated state machines, distributed databases, and coordination services. Without consensus, you cannot build reliable distributed systems.

The Consensus Problem

A consensus algorithm must satisfy three properties:

Agreement: All non-faulty nodes decide on the same value
Validity: The decided value must have been proposed by some node
Termination: All non-faulty nodes eventually decide on a value

FLP Impossibility

Fischer, Lynch, and Paterson proved in 1985 that no deterministic consensus algorithm can guarantee all three properties in an asynchronous system where even one process can crash. In practice, algorithms circumvent this impossibility by using timeouts (partial synchrony) or randomization to ensure progress.

Paxos Algorithm (Simplified)

Paxos, invented by Leslie Lamport in 1989, was the first proven consensus algorithm. It is notoriously difficult to understand and implement. Here is the simplified single-decree version (consensus on a single value).

Paxos Roles

Role	Responsibility
Proposer	Proposes values and drives the protocol
Acceptor	Votes on proposals and stores accepted values
Learner	Learns the decided value once consensus is reached

Paxos Protocol Phases

// Phase 1: Prepare
// Proposer selects a proposal number n and sends Prepare(n) to a majority of acceptors

interface PrepareRequest {
  proposalNumber: number;
}

interface PrepareResponse {
  promise: boolean;           // Will the acceptor promise not to accept lower proposals?
  acceptedProposal?: number;  // Highest proposal number previously accepted
  acceptedValue?: string;     // Value from that proposal
}

// Phase 2: Accept
// If proposer receives promises from a majority:
// - If any acceptor already accepted a value, proposer must use the value
//   from the highest-numbered accepted proposal
// - Otherwise, proposer can use its own value
// Proposer sends Accept(n, value) to a majority of acceptors

interface AcceptRequest {
  proposalNumber: number;
  value: string;
}

interface AcceptResponse {
  accepted: boolean;
}

// Simplified Paxos Proposer
class PaxosProposer {
  private proposalNumber: number = 0;

  async propose(value: string, acceptors: Acceptor[]): Promise<string> {
    const majority = Math.floor(acceptors.length / 2) + 1;

    // Phase 1: Prepare
    this.proposalNumber++;
    const prepareResponses = await Promise.all(
      acceptors.map((a) => a.prepare({ proposalNumber: this.proposalNumber }))
    );

    const promises = prepareResponses.filter((r) => r.promise);
    if (promises.length < majority) {
      throw new Error("Failed to get majority promises");
    }

    // Use the value from the highest accepted proposal, or our own
    const highestAccepted = promises
      .filter((r) => r.acceptedProposal !== undefined)
      .sort((a, b) => (b.acceptedProposal || 0) - (a.acceptedProposal || 0))[0];

    const finalValue = highestAccepted?.acceptedValue || value;

    // Phase 2: Accept
    const acceptResponses = await Promise.all(
      acceptors.map((a) =>
        a.accept({ proposalNumber: this.proposalNumber, value: finalValue })
      )
    );

    const accepted = acceptResponses.filter((r) => r.accepted);
    if (accepted.length < majority) {
      throw new Error("Failed to get majority acceptance");
    }

    return finalValue; // Consensus reached!
  }
}

Raft Algorithm (Detailed)

Raft was designed by Diego Ongaro and John Ousterhout in 2014 as an understandable alternative to Paxos. It decomposes consensus into three sub-problems: leader election, log replication, and safety.

Raft Node States

Every node in a Raft cluster is in one of three states: Follower, Candidate, or Leader. All nodes start as followers.

enum RaftState {
  FOLLOWER = "FOLLOWER",
  CANDIDATE = "CANDIDATE",
  LEADER = "LEADER",
}

interface LogEntry {
  term: number;
  index: number;
  command: string;
}

interface RaftNode {
  id: string;
  state: RaftState;
  currentTerm: number;
  votedFor: string | null;
  log: LogEntry[];
  commitIndex: number;
  lastApplied: number;

  // Leader-only state
  nextIndex: Map<string, number>;  // For each follower: next log index to send
  matchIndex: Map<string, number>; // For each follower: highest replicated index
}

Leader Election

class RaftElection {
  private electionTimeout: number; // Random between 150-300ms
  private heartbeatInterval: number = 50; // Leader sends heartbeats

  startElection(node: RaftNode, peers: string[]): void {
    // 1. Increment current term
    node.currentTerm++;
    node.state = RaftState.CANDIDATE;
    node.votedFor = node.id; // Vote for self

    let votesReceived = 1; // Self-vote
    const majority = Math.floor(peers.length / 2) + 1;

    // 2. Send RequestVote to all peers
    for (const peerId of peers) {
      const response = this.sendRequestVote(peerId, {
        term: node.currentTerm,
        candidateId: node.id,
        lastLogIndex: node.log.length - 1,
        lastLogTerm: node.log.length > 0
          ? node.log[node.log.length - 1].term
          : 0,
      });

      if (response.voteGranted) {
        votesReceived++;
      }
      if (response.term > node.currentTerm) {
        // Discovered higher term: step down
        node.currentTerm = response.term;
        node.state = RaftState.FOLLOWER;
        return;
      }
    }

    // 3. Check if we won
    if (votesReceived >= majority) {
      node.state = RaftState.LEADER;
      this.initializeLeaderState(node, peers);
      this.sendHeartbeats(node, peers);
    }
  }

  // A node grants a vote if:
  // 1. The candidate's term >= the voter's term
  // 2. The voter hasn't voted for someone else in this term
  // 3. The candidate's log is at least as up-to-date as the voter's
  handleRequestVote(node: RaftNode, request: RequestVoteRequest): RequestVoteResponse {
    if (request.term < node.currentTerm) {
      return { term: node.currentTerm, voteGranted: false };
    }

    if (request.term > node.currentTerm) {
      node.currentTerm = request.term;
      node.state = RaftState.FOLLOWER;
      node.votedFor = null;
    }

    const logOk = request.lastLogTerm > this.getLastLogTerm(node) ||
      (request.lastLogTerm === this.getLastLogTerm(node) &&
       request.lastLogIndex >= node.log.length - 1);

    if ((node.votedFor === null || node.votedFor === request.candidateId) && logOk) {
      node.votedFor = request.candidateId;
      return { term: node.currentTerm, voteGranted: true };
    }

    return { term: node.currentTerm, voteGranted: false };
  }
}

Log Replication

Once a leader is elected, it accepts client requests, appends them to its log, and replicates the log entries to followers. An entry is considered committed once a majority of nodes have replicated it.

class RaftLogReplication {
  async replicateEntry(
    leader: RaftNode,
    command: string,
    peers: string[]
  ): Promise<boolean> {
    // 1. Append to leader's log
    const entry: LogEntry = {
      term: leader.currentTerm,
      index: leader.log.length,
      command,
    };
    leader.log.push(entry);

    // 2. Send AppendEntries to all followers
    let replicatedCount = 1; // Leader counts as one
    const majority = Math.floor(peers.length / 2) + 1;

    const results = await Promise.all(
      peers.map((peerId) =>
        this.sendAppendEntries(peerId, {
          term: leader.currentTerm,
          leaderId: leader.id,
          prevLogIndex: entry.index - 1,
          prevLogTerm: entry.index > 0 ? leader.log[entry.index - 1].term : 0,
          entries: [entry],
          leaderCommit: leader.commitIndex,
        })
      )
    );

    for (const result of results) {
      if (result.success) replicatedCount++;
    }

    // 3. If replicated to majority, commit
    if (replicatedCount >= majority) {
      leader.commitIndex = entry.index;
      return true; // Entry is committed and safe
    }

    return false;
  }
}

ZooKeeper and etcd

ZooKeeper and etcd are the two most widely used consensus-based coordination services.

Feature	ZooKeeper	etcd
Consensus Algorithm	ZAB (ZooKeeper Atomic Broadcast)	Raft
Language	Java	Go
Data Model	Hierarchical (tree of znodes)	Flat key-value store
Watch Mechanism	One-time triggers (must re-register)	Persistent watches with streaming
Primary Users	Kafka, Hadoop, HBase	Kubernetes, CoreDNS

Practical Applications

Leader election: Ensure only one instance of a service acts as leader (e.g., one scheduler, one primary database)
Configuration management: Store and distribute configuration with strong consistency guarantees
Distributed locking: Coordinate access to shared resources across multiple services
Service discovery: Register and discover service instances with health checking
Replicated state machines: Build fault-tolerant services that survive node failures by replicating the operation log
Metadata management: Store partition maps, cluster membership, and schema information

When NOT to Use Consensus

Consensus algorithms add latency (every write requires a majority quorum) and complexity. Do not use them for high-throughput data storage or when eventual consistency is acceptable. Use consensus for coordination metadata (who is the leader? what is the partition map?) and use simpler replication for bulk data.

Distributed Consensus Algorithms

Distributed Consensus Algorithms

The Consensus Problem

FLP Impossibility

Paxos Algorithm (Simplified)

Paxos Roles

Paxos Protocol Phases

Raft Algorithm (Detailed)

Raft Node States

Leader Election

Log Replication

ZooKeeper and etcd

Practical Applications

When NOT to Use Consensus

Continue Learning

Software Architecture

Cloud & Kubernetes

Performance Engineering

Data Engineering

Engineering Leadership