TechLead
Lesson 7 of 22
6 min read
Data Engineering

Kafka Streams and Stream Processing

Build real-time stream processing applications with Kafka Streams, including stateful operations, windowing, and joins

What is Kafka Streams?

Kafka Streams is a client library for building real-time stream processing applications that read from and write to Kafka topics. Unlike external processing frameworks like Spark or Flink, Kafka Streams runs as part of your application — no separate cluster is needed. It is lightweight, fault-tolerant, and scales by simply running more instances of your application.

Kafka Streams provides two APIs: the high-level DSL (Domain Specific Language) for common operations like filter, map, join, and aggregate, and the lower-level Processor API for custom processing logic. For Python developers, Faust and the confluent-kafka library provide similar stream processing capabilities. We will focus on Python-based stream processing patterns that mirror Kafka Streams concepts.

Kafka Streams vs Other Frameworks

  • Kafka Streams: Library (no separate cluster), exactly-once, Java/Kotlin native. Best for Kafka-centric applications.
  • Apache Flink: Full framework with managed state, event-time processing, complex windowing. Best for advanced stream processing at large scale.
  • Spark Structured Streaming: Micro-batch model, good for teams already using Spark. Best for near-real-time analytics.
  • Faust (Python): Python library inspired by Kafka Streams. Best for Python-native teams that need stream processing.

Stream Processing Concepts

Before building stream processors, you need to understand several core concepts that apply across all frameworks:

  • KStream: An unbounded sequence of events (an event stream). Each event is independent — the stream represents a log of everything that happened.
  • KTable: A changelog stream where each event represents an update to a key. Only the latest value for each key is retained — like a materialized view of a database table.
  • Stateless Operations: Operations that do not need to remember previous events — filter, map, flatMap. These are simple and easy to parallelize.
  • Stateful Operations: Operations that maintain state across events — count, sum, join, windowed aggregation. These require fault-tolerant state stores.
  • Windowing: Grouping events by time ranges for aggregation — tumbling windows, sliding windows, session windows.

Stateless Stream Processing

Stateless operations transform each event independently without reference to other events. They are the simplest and most common stream processing patterns:

import faust

app = faust.App(
    'order-processor',
    broker='kafka://kafka:9092',
    value_serializer='json',
)

# Define event models
class Order(faust.Record):
    order_id: int
    customer_id: int
    amount: float
    product: str
    status: str

# Source topic
orders_topic = app.topic('orders', value_type=Order)

# Output topics
high_value_topic = app.topic('high-value-orders', value_type=Order)
enriched_topic = app.topic('enriched-orders', value_type=dict)

# Filter: Route high-value orders to a separate topic
@app.agent(orders_topic)
async def filter_high_value(orders):
    async for order in orders:
        if order.amount > 100.0:
            await high_value_topic.send(value=order)

# Map: Enrich orders with additional fields
@app.agent(orders_topic)
async def enrich_orders(orders):
    async for order in orders:
        enriched = {
            "order_id": order.order_id,
            "customer_id": order.customer_id,
            "amount": order.amount,
            "product": order.product,
            "status": order.status,
            "amount_tier": "high" if order.amount > 100 else "standard",
            "needs_review": order.amount > 500,
        }
        await enriched_topic.send(value=enriched)

Stateful Stream Processing

Stateful operations maintain state across events. This enables aggregations (counting events, summing values), joins (combining events from different streams), and pattern detection. Stateful processing is where stream processing becomes truly powerful — and more complex.

import faust
from datetime import timedelta

app = faust.App(
    'order-analytics',
    broker='kafka://kafka:9092',
    value_serializer='json',
)

class Order(faust.Record):
    order_id: int
    customer_id: int
    amount: float
    product: str

orders_topic = app.topic('orders', value_type=Order)

# Stateful: Running count and sum per customer using a Faust table
customer_stats = app.Table(
    'customer-stats',
    default=lambda: {"count": 0, "total": 0.0},
    partitions=6,
)

@app.agent(orders_topic)
async def track_customer_stats(orders):
    async for order in orders:
        customer_id = str(order.customer_id)

        # Update state atomically
        stats = customer_stats[customer_id]
        stats["count"] += 1
        stats["total"] += order.amount
        customer_stats[customer_id] = stats

        # Alert if customer spending exceeds threshold
        if stats["total"] > 10000:
            print(f"High-spend alert: customer {customer_id} "
                  f"has spent {stats['total']:.2f} across {stats['count']} orders")

# Tumbling window: Revenue per product per hour
product_revenue = app.Table(
    'product-revenue',
    default=float,
).tumbling(timedelta(hours=1), expires=timedelta(days=1))

@app.agent(orders_topic)
async def track_product_revenue(orders):
    async for order in orders:
        product_revenue[order.product] += order.amount

# Expose current windowed stats via a web endpoint
@app.page('/stats/{product}/')
async def get_product_stats(web, request, product):
    current = product_revenue[product].current()
    return web.json({"product": product, "hourly_revenue": current})

Windowing Strategies

Windowing is how stream processors group unbounded data into finite chunks for aggregation. The choice of window type depends on your use case:

Window Type Description Use Case
TumblingFixed-size, non-overlapping windows (e.g., every 5 minutes)Hourly revenue reports, periodic aggregations
Hopping/SlidingFixed-size windows that overlap (e.g., 10-min window every 5 min)Moving averages, trend detection
SessionDynamic windows based on activity gaps (e.g., 30-min inactivity gap)User session analysis, clickstream
GlobalSingle window encompassing all eventsAll-time counters, running totals

Stream-Table Joins

One of the most powerful stream processing patterns is joining a stream of events with a table of reference data. For example, enriching an order stream with customer details from a customer table.

# Stream-Table Join: Enrich orders with customer data
# The customer table is built from a compacted Kafka topic

class Customer(faust.Record):
    customer_id: int
    name: str
    segment: str
    country: str

customers_topic = app.topic('customers', value_type=Customer)
customers_table = app.Table('customers', value_type=Customer)

# Populate the customer table from the customers topic (compacted log)
@app.agent(customers_topic)
async def update_customers(customers):
    async for customer in customers:
        customers_table[str(customer.customer_id)] = customer

# Join: Enrich each order with customer data from the table
@app.agent(orders_topic)
async def enrich_with_customer(orders):
    async for order in orders:
        customer = customers_table.get(str(order.customer_id))
        enriched = {
            "order_id": order.order_id,
            "customer_id": order.customer_id,
            "amount": order.amount,
            "product": order.product,
            "customer_name": customer.name if customer else "Unknown",
            "customer_segment": customer.segment if customer else "Unknown",
            "country": customer.country if customer else "Unknown",
        }
        await enriched_topic.send(value=enriched)
        print(f"Enriched order {order.order_id} with customer {enriched['customer_name']}")

Exactly-Once Processing

Delivery Guarantees

  • At-Most-Once: Events may be lost but are never processed twice. Achieved by committing offsets before processing. Rarely acceptable for business data.
  • At-Least-Once: Events are never lost but may be processed more than once. Achieved by committing offsets after processing. Most common guarantee — requires idempotent processing.
  • Exactly-Once: Events are processed exactly once. Achieved through Kafka transactions that atomically commit offsets and output messages. Available in Kafka Streams and Flink natively.

Key Takeaways

  • Kafka Streams is a lightweight library for stream processing — no separate cluster needed
  • Stateless operations (filter, map) are simple; stateful operations (aggregation, joins) require state management
  • Windowing groups unbounded data into finite chunks — choose tumbling, sliding, or session windows based on your use case
  • Stream-table joins enrich events with reference data in real-time
  • Exactly-once processing is achievable through Kafka transactions but adds complexity and latency
  • Python alternatives like Faust provide similar concepts for Python-native stream processing

Continue Learning