TechLead
Lesson 4 of 22
8 min read
Data Engineering

Batch vs Stream Processing

Understand the differences between batch and stream processing, when to use each, and how to combine them in modern data architectures

Two Paradigms of Data Processing

Data processing can be broadly categorized into two paradigms: batch processing and stream processing. Batch processing operates on large, bounded datasets — collecting data over a period of time and processing it all at once. Stream processing operates on data as it arrives, processing each event individually or in micro-batches with minimal latency. Every data engineer must understand both paradigms and know when each is appropriate.

The choice between batch and stream processing affects every layer of your architecture: ingestion, storage, transformation, and serving. It also impacts cost, complexity, and the freshness of data available to your users. Most production systems use a combination of both approaches.

Batch Processing

Batch processing collects data over a defined period (hourly, daily, weekly) and processes it as a single unit. This is the traditional approach to data processing and remains the backbone of most analytics platforms. Batch jobs typically run on a schedule — a nightly job that processes the previous day's data, an hourly aggregation of clickstream events, or a weekly report generation pipeline.

Batch Processing Characteristics

  • Bounded Data: Processes a finite, well-defined dataset (e.g., "all orders from yesterday")
  • High Throughput: Optimized for processing large volumes efficiently — can crunch terabytes in a single run
  • Higher Latency: Data is not available until the batch completes — typically minutes to hours of delay
  • Simpler Error Handling: Failed batches can be retried entirely; easier to reason about correctness
  • Cost Efficient: Compute resources are used only during batch windows; can use spot/preemptible instances
  • Mature Tooling: Well-established tools and patterns — Spark, SQL, dbt, Airflow
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, count, sum as spark_sum, avg,
    date_format, current_date, datediff
)

spark = SparkSession.builder     .appName("DailyOrderBatch")     .config("spark.sql.adaptive.enabled", "true")     .getOrCreate()

# Read yesterday's orders from the data lake
yesterday = "2025-03-15"
orders = spark.read.parquet(f"s3://data-lake/raw/orders/date={yesterday}/")
products = spark.read.parquet("s3://data-lake/silver/products/")
customers = spark.read.parquet("s3://data-lake/silver/customers/")

# Batch transformation: clean, enrich, aggregate
clean_orders = (orders
    .dropDuplicates(["order_id"])
    .filter(col("status") != "cancelled")
    .filter(col("amount") > 0)
)

# Enrich with dimensions
enriched = (clean_orders
    .join(customers, "customer_id", "left")
    .join(products, "product_id", "left")
    .select(
        "order_id", "customer_id", "customer_name",
        "customer_segment", "product_id", "product_name",
        "category", "amount", "quantity", "order_date"
    )
)

# Aggregate for the daily summary table
daily_summary = enriched.groupBy("category", "customer_segment").agg(
    count("order_id").alias("total_orders"),
    spark_sum("amount").alias("total_revenue"),
    avg("amount").alias("avg_order_value"),
    count("customer_id").alias("unique_customers"),
)

# Write to the warehouse / gold layer
daily_summary.write     .mode("overwrite")     .partitionBy("category")     .parquet(f"s3://data-lake/gold/daily_summary/date={yesterday}/")

print(f"Batch complete: processed {clean_orders.count()} orders for {yesterday}")

Stream Processing

Stream processing handles data as it arrives, event by event or in micro-batches (sub-second to seconds). Instead of waiting for data to accumulate, stream processors react to each event in near real-time. This enables use cases like fraud detection, real-time dashboards, live recommendations, and instant alerting that are impossible with batch processing.

Stream Processing Characteristics

  • Unbounded Data: Processes a continuous, never-ending stream of events
  • Low Latency: Data is processed within milliseconds to seconds of arrival
  • Event-at-a-Time: Each event is processed individually (or in small micro-batches)
  • Windowing: Aggregations use time windows (tumbling, sliding, session) to group events
  • Complex State Management: Maintaining state across events requires careful handling of checkpoints and exactly-once semantics
  • Always-On: Stream processors run continuously, requiring robust monitoring and fault tolerance
from confluent_kafka import Consumer, Producer
import json
from datetime import datetime, timedelta
from collections import defaultdict

# Stream processor: Real-time fraud detection
class FraudDetector:
    def __init__(self):
        self.consumer = Consumer({
            'bootstrap.servers': 'kafka:9092',
            'group.id': 'fraud-detector',
            'auto.offset.reset': 'latest',
        })
        self.producer = Producer({'bootstrap.servers': 'kafka:9092'})
        self.consumer.subscribe(['transactions'])

        # In-memory state: track transactions per customer
        self.customer_transactions = defaultdict(list)
        self.VELOCITY_WINDOW = timedelta(minutes=5)
        self.VELOCITY_THRESHOLD = 5  # Max 5 txns in 5 minutes
        self.AMOUNT_THRESHOLD = 10000  # Flag large transactions

    def process_event(self, event: dict):
        """Process each transaction event in real-time."""
        customer_id = event["customer_id"]
        amount = event["amount"]
        timestamp = datetime.fromisoformat(event["timestamp"])

        # Clean old transactions from the window
        cutoff = timestamp - self.VELOCITY_WINDOW
        self.customer_transactions[customer_id] = [
            t for t in self.customer_transactions[customer_id]
            if t["timestamp"] > cutoff
        ]

        # Add current transaction
        self.customer_transactions[customer_id].append({
            "amount": amount,
            "timestamp": timestamp,
        })

        # Check fraud rules
        alerts = []
        window_txns = self.customer_transactions[customer_id]

        if len(window_txns) > self.VELOCITY_THRESHOLD:
            alerts.append(f"High velocity: {len(window_txns)} transactions in 5 min")

        if amount > self.AMOUNT_THRESHOLD:
            alerts.append(f"Large transaction: {amount}")

        if alerts:
            alert = {
                "customer_id": customer_id,
                "transaction_id": event["transaction_id"],
                "amount": amount,
                "alerts": alerts,
                "timestamp": timestamp.isoformat(),
            }
            self.producer.produce(
                "fraud-alerts",
                value=json.dumps(alert).encode()
            )
            self.producer.flush()

    def run(self):
        """Main event loop — runs continuously."""
        print("Starting fraud detection stream processor...")
        while True:
            msg = self.consumer.poll(timeout=1.0)
            if msg is None:
                continue
            if msg.error():
                print(f"Consumer error: {msg.error()}")
                continue

            event = json.loads(msg.value().decode())
            self.process_event(event)

detector = FraudDetector()
detector.run()

Batch vs Stream: Detailed Comparison

Dimension Batch Processing Stream Processing
DataBounded, finite datasetsUnbounded, continuous events
LatencyMinutes to hoursMilliseconds to seconds
ThroughputVery high (optimized for volume)High (but per-event overhead)
ComplexityLower — simpler to debug and reason aboutHigher — state management, ordering, windowing
Error HandlingRetry entire batchDead-letter queues, checkpointing
CostPay for batch window onlyAlways-on compute costs
Use CasesReports, ML training, data warehouse loadsFraud detection, real-time dashboards, alerting
ToolsSpark, dbt, SQL, AirflowKafka Streams, Flink, Spark Structured Streaming

The Lambda Architecture

The Lambda Architecture combines batch and stream processing to serve both historical and real-time queries. It runs two parallel pipelines: a batch layer that processes complete historical data for accuracy, and a speed layer that processes real-time events for freshness. A serving layer merges results from both.

# Lambda Architecture Layout
#
# Source Data ──┬──> Batch Layer (Spark, daily jobs)
#               │      └──> Batch Views (warehouse tables)
#               │                    │
#               │              Serving Layer ──> Queries
#               │                    │
#               └──> Speed Layer (Kafka Streams, real-time)
#                      └──> Real-time Views (Redis, live tables)
#
# Pros: Accurate historical data + real-time freshness
# Cons: Two codebases to maintain, complexity of merging results

# Example: Batch layer (runs nightly)
batch_daily_revenue:
  schedule: "0 2 * * *"  # 2 AM daily
  query: |
    INSERT INTO analytics.daily_revenue
    SELECT date, SUM(amount), COUNT(*)
    FROM raw.orders
    WHERE date = CURRENT_DATE - 1
    GROUP BY date

# Speed layer (runs continuously)
stream_live_revenue:
  source: kafka://orders
  processor: aggregate_revenue
  sink: redis://live-metrics/revenue
  window: tumbling_1_minute

The Kappa Architecture

The Kappa Architecture simplifies Lambda by using only a stream processing layer. All data — both historical and real-time — is processed through the same streaming pipeline. Historical reprocessing is done by replaying events from the stream (e.g., Kafka with long retention). This eliminates the need to maintain two separate codebases.

  • Single Codebase: One processing pipeline handles both real-time and historical data
  • Event Replay: Reprocess historical data by replaying Kafka topics from the beginning
  • Simpler Operations: One system to deploy, monitor, and debug instead of two
  • Trade-off: Reprocessing can be slower and more expensive than batch for very large historical datasets

Micro-Batch: The Middle Ground

Micro-batching processes data in very small batches (every few seconds to minutes), providing near-real-time latency with batch-like simplicity. Spark Structured Streaming is the most popular micro-batch framework:

from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    from_json, col, window, count, sum as spark_sum
)
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType

spark = SparkSession.builder.appName("MicroBatch").getOrCreate()

# Define schema for incoming Kafka events
schema = StructType()     .add("order_id", StringType())     .add("customer_id", StringType())     .add("amount", DoubleType())     .add("timestamp", TimestampType())

# Read from Kafka as a streaming DataFrame
orders_stream = (spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "kafka:9092")
    .option("subscribe", "orders")
    .load()
    .select(from_json(col("value").cast("string"), schema).alias("data"))
    .select("data.*")
)

# Tumbling window aggregation: revenue per 5-minute window
windowed_revenue = (orders_stream
    .withWatermark("timestamp", "10 minutes")
    .groupBy(window("timestamp", "5 minutes"))
    .agg(
        count("order_id").alias("order_count"),
        spark_sum("amount").alias("total_revenue"),
    )
)

# Write results to a sink (console, database, or another Kafka topic)
query = (windowed_revenue
    .writeStream
    .outputMode("update")
    .format("console")
    .option("checkpointLocation", "/tmp/checkpoint/revenue")
    .trigger(processingTime="30 seconds")  # Micro-batch every 30 seconds
    .start()
)

query.awaitTermination()

Choosing the Right Approach

Decision Framework

  • Use Batch when: Data freshness of hours is acceptable, you need to process large historical datasets, cost efficiency is a priority, or you are building data warehouse models with dbt
  • Use Streaming when: You need sub-second latency for fraud detection, real-time dashboards, instant notifications, or live operational systems
  • Use Micro-Batch when: You need near-real-time (seconds to minutes) but want simpler operations than true streaming — a good middle ground for most analytics use cases
  • Use Both when: You need real-time operational views AND accurate historical analytics — stream for freshness, batch for correctness

Key Takeaways

  • Batch processing handles bounded datasets with high throughput but higher latency — ideal for analytics and ML training
  • Stream processing handles unbounded events with low latency but higher complexity — ideal for real-time applications
  • Micro-batching provides a practical middle ground with near-real-time latency and batch-like simplicity
  • Lambda Architecture runs both batch and stream in parallel; Kappa simplifies to stream-only with replay
  • Most production systems use batch for the majority of workloads and streaming for time-sensitive use cases
  • Start with batch — only add streaming when you have a genuine real-time requirement

Continue Learning