TechLead
Lesson 5 of 22
6 min read
Data Engineering

Apache Kafka Fundamentals

Learn Kafka's architecture, core concepts like topics, partitions, and brokers, and understand how distributed event streaming works

What is Apache Kafka?

Apache Kafka is a distributed event streaming platform capable of handling trillions of events per day. Originally developed at LinkedIn and open-sourced in 2011, Kafka has become the backbone of real-time data infrastructure at companies like Netflix, Uber, Airbnb, and thousands more. It serves as a central nervous system for data, enabling applications to publish, subscribe to, store, and process streams of events in real-time and at scale.

Unlike traditional message queues that delete messages after consumption, Kafka persists events to disk in an ordered, immutable log. This fundamental design choice enables multiple consumers to read the same data independently, replay historical events, and build diverse applications on top of a single stream of truth. Kafka is not just a message broker — it is a distributed commit log and the foundation for event-driven architectures.

Kafka Architecture

Kafka's architecture is built around a few core concepts that work together to provide durability, scalability, and high throughput. Understanding these concepts is essential before writing any Kafka code.

Core Components

  • Broker: A Kafka server that stores data and serves client requests. A Kafka cluster consists of multiple brokers for redundancy and scalability. Each broker can handle hundreds of thousands of reads and writes per second.
  • Topic: A named category or feed of events. Topics are the primary abstraction in Kafka — producers write to topics, consumers read from topics. Think of a topic as a table in a database, but append-only.
  • Partition: Topics are divided into partitions for parallelism. Each partition is an ordered, immutable sequence of events. Partitions are distributed across brokers and can be consumed in parallel.
  • Offset: A unique, sequential ID assigned to each event within a partition. Consumers track their position using offsets, enabling replay and exactly-once processing.
  • Producer: A client application that publishes events to Kafka topics.
  • Consumer: A client application that subscribes to topics and processes events.
  • Consumer Group: A group of consumers that cooperatively consume a topic. Each partition is assigned to exactly one consumer in the group, enabling parallel processing.
  • ZooKeeper / KRaft: Manages cluster metadata and leader election. Kafka is migrating from ZooKeeper to the built-in KRaft consensus protocol for simpler operations.

Topics and Partitions

A topic is a logical channel for events. When you create a topic, you specify the number of partitions. Each partition is stored on a broker and contains an ordered sequence of events. The partition count determines the maximum parallelism — you can have at most as many consumers (in a group) as partitions.

# Create a topic with 6 partitions and replication factor of 3
kafka-topics.sh --create     --bootstrap-server kafka1:9092     --topic orders     --partitions 6     --replication-factor 3

# Describe topic to see partition assignment
kafka-topics.sh --describe     --bootstrap-server kafka1:9092     --topic orders

# Output:
# Topic: orders  PartitionCount: 6  ReplicationFactor: 3
# Partition: 0   Leader: 1   Replicas: 1,2,3   Isr: 1,2,3
# Partition: 1   Leader: 2   Replicas: 2,3,1   Isr: 2,3,1
# Partition: 2   Leader: 3   Replicas: 3,1,2   Isr: 3,1,2
# ...

# List all topics
kafka-topics.sh --list --bootstrap-server kafka1:9092

# Produce a test message
echo '{"order_id": 1, "amount": 29.99}' | kafka-console-producer.sh     --bootstrap-server kafka1:9092     --topic orders

# Consume messages from the beginning
kafka-console-consumer.sh     --bootstrap-server kafka1:9092     --topic orders     --from-beginning

How Partitioning Works

Partitioning is Kafka's primary mechanism for scalability and parallelism. Events are distributed across partitions based on a partition key (typically a business identifier like customer_id or order_id). Events with the same key always go to the same partition, preserving ordering for that key.

from confluent_kafka import Producer
import json

producer = Producer({'bootstrap.servers': 'kafka:9092'})

# Events with the same key go to the same partition
# This ensures ordering per customer
events = [
    {"key": "customer-101", "value": {"order_id": 1, "amount": 29.99}},
    {"key": "customer-102", "value": {"order_id": 2, "amount": 49.99}},
    {"key": "customer-101", "value": {"order_id": 3, "amount": 19.99}},
    # order 1 and order 3 go to the same partition (same customer)
    # order 2 goes to a potentially different partition
]

for event in events:
    producer.produce(
        topic="orders",
        key=event["key"].encode("utf-8"),
        value=json.dumps(event["value"]).encode("utf-8"),
    )

producer.flush()
print(f"Produced {len(events)} events")

Replication and Fault Tolerance

Kafka replicates each partition across multiple brokers to ensure data durability. The replication factor determines how many copies of each partition exist. One replica is designated as the leader (handles all reads and writes), while others are followers that replicate the leader's data. If the leader fails, one of the in-sync replicas (ISR) is automatically promoted to leader.

Replication Concepts

  • Replication Factor: Number of copies of each partition. A factor of 3 means 3 brokers each have a copy, tolerating 2 broker failures.
  • Leader: The broker that handles all produce and consume requests for a partition.
  • In-Sync Replicas (ISR): Replicas that are fully caught up with the leader. Only ISR members can become the new leader.
  • Acknowledgment (acks): Producers can choose acks=0 (fire and forget), acks=1 (leader confirms), or acks=all (all ISR confirm) for different durability guarantees.

Kafka Configuration

# docker-compose.yml for a 3-broker Kafka cluster with KRaft
version: '3.8'
services:
  kafka1:
    image: confluentinc/cp-kafka:7.6.0
    hostname: kafka1
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka1:9093,2@kafka2:9093,3@kafka3:9093
      KAFKA_LISTENERS: PLAINTEXT://kafka1:9092,CONTROLLER://kafka1:9093
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_LOG_RETENTION_HOURS: 168       # 7 days retention
      KAFKA_LOG_SEGMENT_BYTES: 1073741824  # 1 GB segments
      KAFKA_NUM_PARTITIONS: 6              # Default partitions for new topics
      KAFKA_DEFAULT_REPLICATION_FACTOR: 3
      KAFKA_MIN_INSYNC_REPLICAS: 2         # Require 2 replicas for acks=all
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'

  kafka2:
    image: confluentinc/cp-kafka:7.6.0
    hostname: kafka2
    ports:
      - "9093:9092"
    environment:
      KAFKA_NODE_ID: 2
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka1:9093,2@kafka2:9093,3@kafka3:9093
      KAFKA_LISTENERS: PLAINTEXT://kafka2:9092,CONTROLLER://kafka2:9093
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'

  kafka3:
    image: confluentinc/cp-kafka:7.6.0
    hostname: kafka3
    ports:
      - "9094:9092"
    environment:
      KAFKA_NODE_ID: 3
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka1:9093,2@kafka2:9093,3@kafka3:9093
      KAFKA_LISTENERS: PLAINTEXT://kafka3:9092,CONTROLLER://kafka3:9093
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'

Kafka Use Cases

Use Case Description Example
Event StreamingCentral event bus for microservicesOrder placed, payment processed, shipment sent
Log AggregationCollect logs from distributed systemsApplication logs to Elasticsearch
Change Data CaptureStream database changes to downstream systemsDebezium captures PostgreSQL WAL changes
Stream ProcessingReal-time transformations and analyticsKafka Streams aggregating click events
Data IntegrationConnect disparate systemsKafka Connect syncing databases

Key Takeaways

  • Kafka is a distributed event streaming platform that persists events in an immutable, ordered log
  • Topics are divided into partitions for parallelism; events with the same key go to the same partition
  • Replication across brokers ensures fault tolerance — a replication factor of 3 tolerates 2 broker failures
  • Consumer groups enable parallel processing; each partition is consumed by exactly one consumer in a group
  • Kafka serves as the backbone for event-driven architectures, CDC, log aggregation, and real-time analytics
  • KRaft mode is replacing ZooKeeper for simpler cluster management and faster recovery

Continue Learning