What is Lakehouse Architecture?

Understand the lakehouse paradigm combining data lake storage with warehouse capabilities using Delta Lake and Apache Iceberg

Lakehouse Architecture - Data Engineering Tutorial | TechLead

What is a Lakehouse?

The lakehouse is a modern data architecture that combines the best features of data lakes and data warehouses into a single platform. It stores data in open formats (Parquet) on cheap object storage (like S3) — just like a data lake — but adds a metadata and transaction layer that enables warehouse-like features: ACID transactions, schema enforcement, time travel, and fast SQL analytics. The lakehouse eliminates the need to maintain separate lake and warehouse systems.

The lakehouse architecture emerged from the realization that maintaining both a data lake and a data warehouse is expensive, complex, and creates data silos. Organizations were ETL-ing data from their lake into a warehouse for analytics, duplicating storage costs and introducing latency. The lakehouse collapses these two systems into one, storing data once in an open format while providing the governance, performance, and reliability that analysts expect from a warehouse.

Lakehouse Key Features

ACID Transactions: Atomic, consistent, isolated, durable operations on data lake files — no more partial writes or corrupted reads
Schema Enforcement & Evolution: Enforce schemas on write while allowing columns to be added, renamed, or retyped safely
Time Travel: Query data as it existed at any point in history — essential for debugging, auditing, and reproducible ML
Unified Batch and Streaming: Same tables support both batch writes and streaming inserts/reads
Open Formats: Data stored in Parquet/ORC with open metadata — no vendor lock-in. Any engine can read the data.
Separation of Compute and Storage: Storage on S3/GCS/ADLS; compute from Spark, Trino, Flink, or native engines

Delta Lake

Delta Lake is an open-source storage layer developed by Databricks that adds ACID transactions, scalable metadata handling, and data versioning to data lake storage. It stores data in Parquet files and maintains a transaction log (_delta_log) that tracks every change, enabling time travel, rollback, and concurrent reads/writes.

from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip

# Configure Spark with Delta Lake
builder = SparkSession.builder     .appName("LakehouseDemo")     .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")     .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

# Create a Delta table
orders_df = spark.read.parquet("s3://raw/orders/")

orders_df.write     .format("delta")     .mode("overwrite")     .partitionBy("order_date")     .save("s3://lakehouse/orders/")

# Read a Delta table
orders = spark.read.format("delta").load("s3://lakehouse/orders/")
orders.show(5)

# ACID: Concurrent writes are safe
# Multiple Spark jobs can write to the same table without corruption

# UPSERT (MERGE): Efficiently update existing rows and insert new ones
from delta.tables import DeltaTable

delta_orders = DeltaTable.forPath(spark, "s3://lakehouse/orders/")

new_data = spark.read.parquet("s3://staging/new_orders/")

delta_orders.alias("target").merge(
    new_data.alias("source"),
    "target.order_id = source.order_id"
).whenMatchedUpdate(set={
    "status": "source.status",
    "updated_at": "source.updated_at",
}).whenNotMatchedInsertAll().execute()

print("MERGE complete: upserted orders")

Time Travel

# Time Travel: Query data as it existed at any point

# Read the table as of a specific version
orders_v5 = spark.read.format("delta")     .option("versionAsOf", 5)     .load("s3://lakehouse/orders/")

# Read the table as of a specific timestamp
orders_yesterday = spark.read.format("delta")     .option("timestampAsOf", "2025-03-14T12:00:00")     .load("s3://lakehouse/orders/")

# View table history
delta_orders = DeltaTable.forPath(spark, "s3://lakehouse/orders/")
delta_orders.history().show(truncate=False)

# Output:
# | version | timestamp           | operation | operationParameters      |
# | 6       | 2025-03-15 10:00:00 | MERGE     | {predicate: order_id...} |
# | 5       | 2025-03-14 10:00:00 | WRITE     | {mode: Overwrite}        |
# | 4       | 2025-03-13 10:00:00 | WRITE     | {mode: Overwrite}        |

# Rollback to a previous version
spark.read.format("delta")     .option("versionAsOf", 4)     .load("s3://lakehouse/orders/")     .write.format("delta")     .mode("overwrite")     .save("s3://lakehouse/orders/")

Apache Iceberg

Apache Iceberg is an open table format designed for huge analytic datasets. Unlike Delta Lake (which originated from Databricks), Iceberg was designed to be engine-agnostic from the start — it works equally well with Spark, Trino, Flink, Dremio, Snowflake, and many other engines. Iceberg provides ACID transactions, schema evolution, hidden partitioning, and time travel.

Iceberg Unique Features

Hidden Partitioning: Users write queries without knowing the partition scheme. Iceberg automatically prunes partitions based on query filters. Partition schemes can evolve without rewriting data.
Engine Agnostic: First-class support in Spark, Trino, Flink, Snowflake, BigQuery, Dremio, and Athena — no vendor lock-in.
Partition Evolution: Change your partitioning scheme (e.g., from daily to hourly) without rewriting existing data. Old and new partitions coexist seamlessly.
Row-Level Operations: Efficient UPDATE and DELETE operations on specific rows using merge-on-read or copy-on-write strategies.

# Apache Iceberg with PySpark
from pyspark.sql import SparkSession

spark = SparkSession.builder     .appName("IcebergDemo")     .config("spark.sql.catalog.lakehouse", "org.apache.iceberg.spark.SparkCatalog")     .config("spark.sql.catalog.lakehouse.type", "hadoop")     .config("spark.sql.catalog.lakehouse.warehouse", "s3://lakehouse/iceberg/")     .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")     .getOrCreate()

# Create an Iceberg table with hidden partitioning
spark.sql("""
    CREATE TABLE lakehouse.analytics.orders (
        order_id     BIGINT,
        customer_id  BIGINT,
        product_id   BIGINT,
        amount       DECIMAL(12,2),
        order_date   DATE,
        status       STRING,
        created_at   TIMESTAMP
    )
    USING iceberg
    PARTITIONED BY (days(order_date))
""")

# Insert data — partitioning is automatic and hidden from users
orders_df = spark.read.parquet("s3://raw/orders/")
orders_df.writeTo("lakehouse.analytics.orders").append()

# Query with automatic partition pruning
# Users do not need to know the partition scheme
result = spark.sql("""
    SELECT status, COUNT(*), SUM(amount)
    FROM lakehouse.analytics.orders
    WHERE order_date >= '2025-03-01'
      AND order_date < '2025-04-01'
    GROUP BY status
""")
result.show()

# Partition evolution: switch from daily to hourly partitioning
# No data rewrite needed!
spark.sql("""
    ALTER TABLE lakehouse.analytics.orders
    ADD PARTITION FIELD hours(created_at)
""")

# Time travel with Iceberg
history = spark.sql("SELECT * FROM lakehouse.analytics.orders.history")
history.show()

# Query a specific snapshot
spark.sql("""
    SELECT * FROM lakehouse.analytics.orders
    TIMESTAMP AS OF '2025-03-14 12:00:00'
""")

Delta Lake vs Apache Iceberg

Feature	Delta Lake	Apache Iceberg
Origin	Databricks	Netflix / Apache Foundation
Engine Support	Best with Spark/Databricks; growing elsewhere	Broad: Spark, Trino, Flink, Snowflake, BigQuery
Partitioning	Explicit (user must specify in queries)	Hidden (automatic partition pruning)
Partition Evolution	Requires data rewrite	In-place evolution without rewrite
Time Travel	Yes (version and timestamp)	Yes (snapshot-based)
ACID Transactions	Yes	Yes
Schema Evolution	Add/rename/reorder columns	Full evolution including type promotion
Community	Strong Databricks ecosystem	Growing multi-vendor ecosystem

The Medallion Architecture in a Lakehouse

# Lakehouse with Medallion Architecture using Delta Lake

# BRONZE: Raw ingestion — append-only, no transformations
spark.readStream     .format("kafka")     .option("kafka.bootstrap.servers", "kafka:9092")     .option("subscribe", "orders")     .load()     .selectExpr("CAST(value AS STRING) AS json_data", "timestamp AS kafka_timestamp")     .writeStream     .format("delta")     .outputMode("append")     .option("checkpointLocation", "s3://lakehouse/checkpoints/bronze_orders")     .table("bronze.raw_orders")

# SILVER: Cleaned, deduplicated, typed data
from delta.tables import DeltaTable
from pyspark.sql.functions import from_json, col, schema_of_json

silver_df = spark.read.format("delta").table("bronze.raw_orders")     .select(from_json(col("json_data"), order_schema).alias("data"))     .select("data.*")     .dropDuplicates(["order_id"])     .filter(col("amount") > 0)

# MERGE into silver (upsert to handle late-arriving updates)
silver_table = DeltaTable.forName(spark, "silver.orders")
silver_table.alias("t").merge(
    silver_df.alias("s"), "t.order_id = s.order_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

# GOLD: Business-level aggregates
gold_df = spark.sql("""
    SELECT
        DATE_TRUNC('month', order_date) AS month,
        country,
        COUNT(DISTINCT order_id) AS orders,
        SUM(amount) AS revenue,
        COUNT(DISTINCT customer_id) AS customers
    FROM silver.orders
    GROUP BY DATE_TRUNC('month', order_date), country
""")

gold_df.write.format("delta")     .mode("overwrite")     .saveAsTable("gold.monthly_revenue")

When to Choose a Lakehouse

Decision Guide

Choose Lakehouse when: You need both analytics and ML on the same data, want to avoid duplicating data between lake and warehouse, need open formats for flexibility, or are building a new data platform from scratch
Choose Cloud Warehouse (Snowflake/BigQuery) when: Your primary use case is BI/analytics, your team is SQL-first, you want a fully managed service, and you do not need ML/data science workloads on the same data
Choose Both when: Use a lakehouse as the primary storage layer and sync curated datasets to a warehouse for BI. This is common at large organizations.

Key Takeaways

The lakehouse combines data lake storage (cheap, open formats) with warehouse features (ACID, schema, SQL)
Delta Lake and Apache Iceberg are the two leading table formats that enable the lakehouse pattern
Time travel lets you query historical versions of your data — essential for debugging, auditing, and reproducible ML
The medallion architecture (bronze/silver/gold) organizes lakehouse data by quality level
Iceberg offers broader engine support and partition evolution; Delta Lake is dominant in Databricks
The lakehouse is becoming the standard architecture for modern data platforms, unifying analytics and ML workloads

Lakehouse Architecture

What is a Lakehouse?

Lakehouse Key Features

Delta Lake

Time Travel

Apache Iceberg

Iceberg Unique Features

Delta Lake vs Apache Iceberg

The Medallion Architecture in a Lakehouse

When to Choose a Lakehouse

Decision Guide

Key Takeaways

Continue Learning

Python

PostgreSQL

System Design

Cloud & Kubernetes

AI Agents & RAG