What is Batch vs Stream Processing?

Understand the differences between batch and stream processing, when to use each, and how to combine them in modern data architectures

Batch vs Stream Processing - Data Engineering Tutorial | TechLead

Two Paradigms of Data Processing

Data processing can be broadly categorized into two paradigms: batch processing and stream processing. Batch processing operates on large, bounded datasets — collecting data over a period of time and processing it all at once. Stream processing operates on data as it arrives, processing each event individually or in micro-batches with minimal latency. Every data engineer must understand both paradigms and know when each is appropriate.

The choice between batch and stream processing affects every layer of your architecture: ingestion, storage, transformation, and serving. It also impacts cost, complexity, and the freshness of data available to your users. Most production systems use a combination of both approaches.

Batch Processing

Batch processing collects data over a defined period (hourly, daily, weekly) and processes it as a single unit. This is the traditional approach to data processing and remains the backbone of most analytics platforms. Batch jobs typically run on a schedule — a nightly job that processes the previous day's data, an hourly aggregation of clickstream events, or a weekly report generation pipeline.

Batch Processing Characteristics

Bounded Data: Processes a finite, well-defined dataset (e.g., "all orders from yesterday")
High Throughput: Optimized for processing large volumes efficiently — can crunch terabytes in a single run
Higher Latency: Data is not available until the batch completes — typically minutes to hours of delay
Simpler Error Handling: Failed batches can be retried entirely; easier to reason about correctness
Cost Efficient: Compute resources are used only during batch windows; can use spot/preemptible instances
Mature Tooling: Well-established tools and patterns — Spark, SQL, dbt, Airflow

from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, count, sum as spark_sum, avg,
    date_format, current_date, datediff
)

spark = SparkSession.builder     .appName("DailyOrderBatch")     .config("spark.sql.adaptive.enabled", "true")     .getOrCreate()

# Read yesterday's orders from the data lake
yesterday = "2025-03-15"
orders = spark.read.parquet(f"s3://data-lake/raw/orders/date={yesterday}/")
products = spark.read.parquet("s3://data-lake/silver/products/")
customers = spark.read.parquet("s3://data-lake/silver/customers/")

# Batch transformation: clean, enrich, aggregate
clean_orders = (orders
    .dropDuplicates(["order_id"])
    .filter(col("status") != "cancelled")
    .filter(col("amount") > 0)
)

# Enrich with dimensions
enriched = (clean_orders
    .join(customers, "customer_id", "left")
    .join(products, "product_id", "left")
    .select(
        "order_id", "customer_id", "customer_name",
        "customer_segment", "product_id", "product_name",
        "category", "amount", "quantity", "order_date"
    )
)

# Aggregate for the daily summary table
daily_summary = enriched.groupBy("category", "customer_segment").agg(
    count("order_id").alias("total_orders"),
    spark_sum("amount").alias("total_revenue"),
    avg("amount").alias("avg_order_value"),
    count("customer_id").alias("unique_customers"),
)

# Write to the warehouse / gold layer
daily_summary.write     .mode("overwrite")     .partitionBy("category")     .parquet(f"s3://data-lake/gold/daily_summary/date={yesterday}/")

print(f"Batch complete: processed {clean_orders.count()} orders for {yesterday}")

Stream Processing

Stream processing handles data as it arrives, event by event or in micro-batches (sub-second to seconds). Instead of waiting for data to accumulate, stream processors react to each event in near real-time. This enables use cases like fraud detection, real-time dashboards, live recommendations, and instant alerting that are impossible with batch processing.

Stream Processing Characteristics

Unbounded Data: Processes a continuous, never-ending stream of events
Low Latency: Data is processed within milliseconds to seconds of arrival
Event-at-a-Time: Each event is processed individually (or in small micro-batches)
Windowing: Aggregations use time windows (tumbling, sliding, session) to group events
Complex State Management: Maintaining state across events requires careful handling of checkpoints and exactly-once semantics
Always-On: Stream processors run continuously, requiring robust monitoring and fault tolerance

from confluent_kafka import Consumer, Producer
import json
from datetime import datetime, timedelta
from collections import defaultdict

# Stream processor: Real-time fraud detection
class FraudDetector:
    def __init__(self):
        self.consumer = Consumer({
            'bootstrap.servers': 'kafka:9092',
            'group.id': 'fraud-detector',
            'auto.offset.reset': 'latest',
        })
        self.producer = Producer({'bootstrap.servers': 'kafka:9092'})
        self.consumer.subscribe(['transactions'])

        # In-memory state: track transactions per customer
        self.customer_transactions = defaultdict(list)
        self.VELOCITY_WINDOW = timedelta(minutes=5)
        self.VELOCITY_THRESHOLD = 5  # Max 5 txns in 5 minutes
        self.AMOUNT_THRESHOLD = 10000  # Flag large transactions

    def process_event(self, event: dict):
        """Process each transaction event in real-time."""
        customer_id = event["customer_id"]
        amount = event["amount"]
        timestamp = datetime.fromisoformat(event["timestamp"])

        # Clean old transactions from the window
        cutoff = timestamp - self.VELOCITY_WINDOW
        self.customer_transactions[customer_id] = [
            t for t in self.customer_transactions[customer_id]
            if t["timestamp"] > cutoff
        ]

        # Add current transaction
        self.customer_transactions[customer_id].append({
            "amount": amount,
            "timestamp": timestamp,
        })

        # Check fraud rules
        alerts = []
        window_txns = self.customer_transactions[customer_id]

        if len(window_txns) > self.VELOCITY_THRESHOLD:
            alerts.append(f"High velocity: {len(window_txns)} transactions in 5 min")

        if amount > self.AMOUNT_THRESHOLD:
            alerts.append(f"Large transaction: {amount}")

        if alerts:
            alert = {
                "customer_id": customer_id,
                "transaction_id": event["transaction_id"],
                "amount": amount,
                "alerts": alerts,
                "timestamp": timestamp.isoformat(),
            }
            self.producer.produce(
                "fraud-alerts",
                value=json.dumps(alert).encode()
            )
            self.producer.flush()

    def run(self):
        """Main event loop — runs continuously."""
        print("Starting fraud detection stream processor...")
        while True:
            msg = self.consumer.poll(timeout=1.0)
            if msg is None:
                continue
            if msg.error():
                print(f"Consumer error: {msg.error()}")
                continue

            event = json.loads(msg.value().decode())
            self.process_event(event)

detector = FraudDetector()
detector.run()

Batch vs Stream: Detailed Comparison

Dimension	Batch Processing	Stream Processing
Data	Bounded, finite datasets	Unbounded, continuous events
Latency	Minutes to hours	Milliseconds to seconds
Throughput	Very high (optimized for volume)	High (but per-event overhead)
Complexity	Lower — simpler to debug and reason about	Higher — state management, ordering, windowing
Error Handling	Retry entire batch	Dead-letter queues, checkpointing
Cost	Pay for batch window only	Always-on compute costs
Use Cases	Reports, ML training, data warehouse loads	Fraud detection, real-time dashboards, alerting
Tools	Spark, dbt, SQL, Airflow	Kafka Streams, Flink, Spark Structured Streaming

The Lambda Architecture

The Lambda Architecture combines batch and stream processing to serve both historical and real-time queries. It runs two parallel pipelines: a batch layer that processes complete historical data for accuracy, and a speed layer that processes real-time events for freshness. A serving layer merges results from both.

# Lambda Architecture Layout
#
# Source Data ──┬──> Batch Layer (Spark, daily jobs)
#               │      └──> Batch Views (warehouse tables)
#               │                    │
#               │              Serving Layer ──> Queries
#               │                    │
#               └──> Speed Layer (Kafka Streams, real-time)
#                      └──> Real-time Views (Redis, live tables)
#
# Pros: Accurate historical data + real-time freshness
# Cons: Two codebases to maintain, complexity of merging results

# Example: Batch layer (runs nightly)
batch_daily_revenue:
  schedule: "0 2 * * *"  # 2 AM daily
  query: |
    INSERT INTO analytics.daily_revenue
    SELECT date, SUM(amount), COUNT(*)
    FROM raw.orders
    WHERE date = CURRENT_DATE - 1
    GROUP BY date

# Speed layer (runs continuously)
stream_live_revenue:
  source: kafka://orders
  processor: aggregate_revenue
  sink: redis://live-metrics/revenue
  window: tumbling_1_minute

The Kappa Architecture

The Kappa Architecture simplifies Lambda by using only a stream processing layer. All data — both historical and real-time — is processed through the same streaming pipeline. Historical reprocessing is done by replaying events from the stream (e.g., Kafka with long retention). This eliminates the need to maintain two separate codebases.

Single Codebase: One processing pipeline handles both real-time and historical data
Event Replay: Reprocess historical data by replaying Kafka topics from the beginning
Simpler Operations: One system to deploy, monitor, and debug instead of two
Trade-off: Reprocessing can be slower and more expensive than batch for very large historical datasets

Micro-Batch: The Middle Ground

Micro-batching processes data in very small batches (every few seconds to minutes), providing near-real-time latency with batch-like simplicity. Spark Structured Streaming is the most popular micro-batch framework:

from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    from_json, col, window, count, sum as spark_sum
)
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType

spark = SparkSession.builder.appName("MicroBatch").getOrCreate()

# Define schema for incoming Kafka events
schema = StructType()     .add("order_id", StringType())     .add("customer_id", StringType())     .add("amount", DoubleType())     .add("timestamp", TimestampType())

# Read from Kafka as a streaming DataFrame
orders_stream = (spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "kafka:9092")
    .option("subscribe", "orders")
    .load()
    .select(from_json(col("value").cast("string"), schema).alias("data"))
    .select("data.*")
)

# Tumbling window aggregation: revenue per 5-minute window
windowed_revenue = (orders_stream
    .withWatermark("timestamp", "10 minutes")
    .groupBy(window("timestamp", "5 minutes"))
    .agg(
        count("order_id").alias("order_count"),
        spark_sum("amount").alias("total_revenue"),
    )
)

# Write results to a sink (console, database, or another Kafka topic)
query = (windowed_revenue
    .writeStream
    .outputMode("update")
    .format("console")
    .option("checkpointLocation", "/tmp/checkpoint/revenue")
    .trigger(processingTime="30 seconds")  # Micro-batch every 30 seconds
    .start()
)

query.awaitTermination()

Choosing the Right Approach

Decision Framework

Use Batch when: Data freshness of hours is acceptable, you need to process large historical datasets, cost efficiency is a priority, or you are building data warehouse models with dbt
Use Streaming when: You need sub-second latency for fraud detection, real-time dashboards, instant notifications, or live operational systems
Use Micro-Batch when: You need near-real-time (seconds to minutes) but want simpler operations than true streaming — a good middle ground for most analytics use cases
Use Both when: You need real-time operational views AND accurate historical analytics — stream for freshness, batch for correctness

Key Takeaways

Batch processing handles bounded datasets with high throughput but higher latency — ideal for analytics and ML training
Stream processing handles unbounded events with low latency but higher complexity — ideal for real-time applications
Micro-batching provides a practical middle ground with near-real-time latency and batch-like simplicity
Lambda Architecture runs both batch and stream in parallel; Kappa simplifies to stream-only with replay
Most production systems use batch for the majority of workloads and streaming for time-sensitive use cases
Start with batch — only add streaming when you have a genuine real-time requirement

Batch vs Stream Processing

Two Paradigms of Data Processing

Batch Processing

Batch Processing Characteristics

Stream Processing

Stream Processing Characteristics

Batch vs Stream: Detailed Comparison

The Lambda Architecture

The Kappa Architecture

Micro-Batch: The Middle Ground

Choosing the Right Approach

Decision Framework

Key Takeaways

Continue Learning

Python

PostgreSQL

System Design

Cloud & Kubernetes

AI Agents & RAG