Two Paradigms of Data Processing
Data processing can be broadly categorized into two paradigms: batch processing and stream processing. Batch processing operates on large, bounded datasets — collecting data over a period of time and processing it all at once. Stream processing operates on data as it arrives, processing each event individually or in micro-batches with minimal latency. Every data engineer must understand both paradigms and know when each is appropriate.
The choice between batch and stream processing affects every layer of your architecture: ingestion, storage, transformation, and serving. It also impacts cost, complexity, and the freshness of data available to your users. Most production systems use a combination of both approaches.
Batch Processing
Batch processing collects data over a defined period (hourly, daily, weekly) and processes it as a single unit. This is the traditional approach to data processing and remains the backbone of most analytics platforms. Batch jobs typically run on a schedule — a nightly job that processes the previous day's data, an hourly aggregation of clickstream events, or a weekly report generation pipeline.
Batch Processing Characteristics
- Bounded Data: Processes a finite, well-defined dataset (e.g., "all orders from yesterday")
- High Throughput: Optimized for processing large volumes efficiently — can crunch terabytes in a single run
- Higher Latency: Data is not available until the batch completes — typically minutes to hours of delay
- Simpler Error Handling: Failed batches can be retried entirely; easier to reason about correctness
- Cost Efficient: Compute resources are used only during batch windows; can use spot/preemptible instances
- Mature Tooling: Well-established tools and patterns — Spark, SQL, dbt, Airflow
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
col, count, sum as spark_sum, avg,
date_format, current_date, datediff
)
spark = SparkSession.builder .appName("DailyOrderBatch") .config("spark.sql.adaptive.enabled", "true") .getOrCreate()
# Read yesterday's orders from the data lake
yesterday = "2025-03-15"
orders = spark.read.parquet(f"s3://data-lake/raw/orders/date={yesterday}/")
products = spark.read.parquet("s3://data-lake/silver/products/")
customers = spark.read.parquet("s3://data-lake/silver/customers/")
# Batch transformation: clean, enrich, aggregate
clean_orders = (orders
.dropDuplicates(["order_id"])
.filter(col("status") != "cancelled")
.filter(col("amount") > 0)
)
# Enrich with dimensions
enriched = (clean_orders
.join(customers, "customer_id", "left")
.join(products, "product_id", "left")
.select(
"order_id", "customer_id", "customer_name",
"customer_segment", "product_id", "product_name",
"category", "amount", "quantity", "order_date"
)
)
# Aggregate for the daily summary table
daily_summary = enriched.groupBy("category", "customer_segment").agg(
count("order_id").alias("total_orders"),
spark_sum("amount").alias("total_revenue"),
avg("amount").alias("avg_order_value"),
count("customer_id").alias("unique_customers"),
)
# Write to the warehouse / gold layer
daily_summary.write .mode("overwrite") .partitionBy("category") .parquet(f"s3://data-lake/gold/daily_summary/date={yesterday}/")
print(f"Batch complete: processed {clean_orders.count()} orders for {yesterday}")
Stream Processing
Stream processing handles data as it arrives, event by event or in micro-batches (sub-second to seconds). Instead of waiting for data to accumulate, stream processors react to each event in near real-time. This enables use cases like fraud detection, real-time dashboards, live recommendations, and instant alerting that are impossible with batch processing.
Stream Processing Characteristics
- Unbounded Data: Processes a continuous, never-ending stream of events
- Low Latency: Data is processed within milliseconds to seconds of arrival
- Event-at-a-Time: Each event is processed individually (or in small micro-batches)
- Windowing: Aggregations use time windows (tumbling, sliding, session) to group events
- Complex State Management: Maintaining state across events requires careful handling of checkpoints and exactly-once semantics
- Always-On: Stream processors run continuously, requiring robust monitoring and fault tolerance
from confluent_kafka import Consumer, Producer
import json
from datetime import datetime, timedelta
from collections import defaultdict
# Stream processor: Real-time fraud detection
class FraudDetector:
def __init__(self):
self.consumer = Consumer({
'bootstrap.servers': 'kafka:9092',
'group.id': 'fraud-detector',
'auto.offset.reset': 'latest',
})
self.producer = Producer({'bootstrap.servers': 'kafka:9092'})
self.consumer.subscribe(['transactions'])
# In-memory state: track transactions per customer
self.customer_transactions = defaultdict(list)
self.VELOCITY_WINDOW = timedelta(minutes=5)
self.VELOCITY_THRESHOLD = 5 # Max 5 txns in 5 minutes
self.AMOUNT_THRESHOLD = 10000 # Flag large transactions
def process_event(self, event: dict):
"""Process each transaction event in real-time."""
customer_id = event["customer_id"]
amount = event["amount"]
timestamp = datetime.fromisoformat(event["timestamp"])
# Clean old transactions from the window
cutoff = timestamp - self.VELOCITY_WINDOW
self.customer_transactions[customer_id] = [
t for t in self.customer_transactions[customer_id]
if t["timestamp"] > cutoff
]
# Add current transaction
self.customer_transactions[customer_id].append({
"amount": amount,
"timestamp": timestamp,
})
# Check fraud rules
alerts = []
window_txns = self.customer_transactions[customer_id]
if len(window_txns) > self.VELOCITY_THRESHOLD:
alerts.append(f"High velocity: {len(window_txns)} transactions in 5 min")
if amount > self.AMOUNT_THRESHOLD:
alerts.append(f"Large transaction: {amount}")
if alerts:
alert = {
"customer_id": customer_id,
"transaction_id": event["transaction_id"],
"amount": amount,
"alerts": alerts,
"timestamp": timestamp.isoformat(),
}
self.producer.produce(
"fraud-alerts",
value=json.dumps(alert).encode()
)
self.producer.flush()
def run(self):
"""Main event loop — runs continuously."""
print("Starting fraud detection stream processor...")
while True:
msg = self.consumer.poll(timeout=1.0)
if msg is None:
continue
if msg.error():
print(f"Consumer error: {msg.error()}")
continue
event = json.loads(msg.value().decode())
self.process_event(event)
detector = FraudDetector()
detector.run()
Batch vs Stream: Detailed Comparison
| Dimension | Batch Processing | Stream Processing |
|---|---|---|
| Data | Bounded, finite datasets | Unbounded, continuous events |
| Latency | Minutes to hours | Milliseconds to seconds |
| Throughput | Very high (optimized for volume) | High (but per-event overhead) |
| Complexity | Lower — simpler to debug and reason about | Higher — state management, ordering, windowing |
| Error Handling | Retry entire batch | Dead-letter queues, checkpointing |
| Cost | Pay for batch window only | Always-on compute costs |
| Use Cases | Reports, ML training, data warehouse loads | Fraud detection, real-time dashboards, alerting |
| Tools | Spark, dbt, SQL, Airflow | Kafka Streams, Flink, Spark Structured Streaming |
The Lambda Architecture
The Lambda Architecture combines batch and stream processing to serve both historical and real-time queries. It runs two parallel pipelines: a batch layer that processes complete historical data for accuracy, and a speed layer that processes real-time events for freshness. A serving layer merges results from both.
# Lambda Architecture Layout
#
# Source Data ──┬──> Batch Layer (Spark, daily jobs)
# │ └──> Batch Views (warehouse tables)
# │ │
# │ Serving Layer ──> Queries
# │ │
# └──> Speed Layer (Kafka Streams, real-time)
# └──> Real-time Views (Redis, live tables)
#
# Pros: Accurate historical data + real-time freshness
# Cons: Two codebases to maintain, complexity of merging results
# Example: Batch layer (runs nightly)
batch_daily_revenue:
schedule: "0 2 * * *" # 2 AM daily
query: |
INSERT INTO analytics.daily_revenue
SELECT date, SUM(amount), COUNT(*)
FROM raw.orders
WHERE date = CURRENT_DATE - 1
GROUP BY date
# Speed layer (runs continuously)
stream_live_revenue:
source: kafka://orders
processor: aggregate_revenue
sink: redis://live-metrics/revenue
window: tumbling_1_minute
The Kappa Architecture
The Kappa Architecture simplifies Lambda by using only a stream processing layer. All data — both historical and real-time — is processed through the same streaming pipeline. Historical reprocessing is done by replaying events from the stream (e.g., Kafka with long retention). This eliminates the need to maintain two separate codebases.
- Single Codebase: One processing pipeline handles both real-time and historical data
- Event Replay: Reprocess historical data by replaying Kafka topics from the beginning
- Simpler Operations: One system to deploy, monitor, and debug instead of two
- Trade-off: Reprocessing can be slower and more expensive than batch for very large historical datasets
Micro-Batch: The Middle Ground
Micro-batching processes data in very small batches (every few seconds to minutes), providing near-real-time latency with batch-like simplicity. Spark Structured Streaming is the most popular micro-batch framework:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
from_json, col, window, count, sum as spark_sum
)
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType
spark = SparkSession.builder.appName("MicroBatch").getOrCreate()
# Define schema for incoming Kafka events
schema = StructType() .add("order_id", StringType()) .add("customer_id", StringType()) .add("amount", DoubleType()) .add("timestamp", TimestampType())
# Read from Kafka as a streaming DataFrame
orders_stream = (spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "orders")
.load()
.select(from_json(col("value").cast("string"), schema).alias("data"))
.select("data.*")
)
# Tumbling window aggregation: revenue per 5-minute window
windowed_revenue = (orders_stream
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp", "5 minutes"))
.agg(
count("order_id").alias("order_count"),
spark_sum("amount").alias("total_revenue"),
)
)
# Write results to a sink (console, database, or another Kafka topic)
query = (windowed_revenue
.writeStream
.outputMode("update")
.format("console")
.option("checkpointLocation", "/tmp/checkpoint/revenue")
.trigger(processingTime="30 seconds") # Micro-batch every 30 seconds
.start()
)
query.awaitTermination()
Choosing the Right Approach
Decision Framework
- Use Batch when: Data freshness of hours is acceptable, you need to process large historical datasets, cost efficiency is a priority, or you are building data warehouse models with dbt
- Use Streaming when: You need sub-second latency for fraud detection, real-time dashboards, instant notifications, or live operational systems
- Use Micro-Batch when: You need near-real-time (seconds to minutes) but want simpler operations than true streaming — a good middle ground for most analytics use cases
- Use Both when: You need real-time operational views AND accurate historical analytics — stream for freshness, batch for correctness
Key Takeaways
- Batch processing handles bounded datasets with high throughput but higher latency — ideal for analytics and ML training
- Stream processing handles unbounded events with low latency but higher complexity — ideal for real-time applications
- Micro-batching provides a practical middle ground with near-real-time latency and batch-like simplicity
- Lambda Architecture runs both batch and stream in parallel; Kappa simplifies to stream-only with replay
- Most production systems use batch for the majority of workloads and streaming for time-sensitive use cases
- Start with batch — only add streaming when you have a genuine real-time requirement