What is a Lakehouse?
The lakehouse is a modern data architecture that combines the best features of data lakes and data warehouses into a single platform. It stores data in open formats (Parquet) on cheap object storage (like S3) — just like a data lake — but adds a metadata and transaction layer that enables warehouse-like features: ACID transactions, schema enforcement, time travel, and fast SQL analytics. The lakehouse eliminates the need to maintain separate lake and warehouse systems.
The lakehouse architecture emerged from the realization that maintaining both a data lake and a data warehouse is expensive, complex, and creates data silos. Organizations were ETL-ing data from their lake into a warehouse for analytics, duplicating storage costs and introducing latency. The lakehouse collapses these two systems into one, storing data once in an open format while providing the governance, performance, and reliability that analysts expect from a warehouse.
Lakehouse Key Features
- ACID Transactions: Atomic, consistent, isolated, durable operations on data lake files — no more partial writes or corrupted reads
- Schema Enforcement & Evolution: Enforce schemas on write while allowing columns to be added, renamed, or retyped safely
- Time Travel: Query data as it existed at any point in history — essential for debugging, auditing, and reproducible ML
- Unified Batch and Streaming: Same tables support both batch writes and streaming inserts/reads
- Open Formats: Data stored in Parquet/ORC with open metadata — no vendor lock-in. Any engine can read the data.
- Separation of Compute and Storage: Storage on S3/GCS/ADLS; compute from Spark, Trino, Flink, or native engines
Delta Lake
Delta Lake is an open-source storage layer developed by Databricks that adds ACID transactions, scalable metadata handling, and data versioning to data lake storage. It stores data in Parquet files and maintains a transaction log (_delta_log) that tracks every change, enabling time travel, rollback, and concurrent reads/writes.
from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip
# Configure Spark with Delta Lake
builder = SparkSession.builder .appName("LakehouseDemo") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
# Create a Delta table
orders_df = spark.read.parquet("s3://raw/orders/")
orders_df.write .format("delta") .mode("overwrite") .partitionBy("order_date") .save("s3://lakehouse/orders/")
# Read a Delta table
orders = spark.read.format("delta").load("s3://lakehouse/orders/")
orders.show(5)
# ACID: Concurrent writes are safe
# Multiple Spark jobs can write to the same table without corruption
# UPSERT (MERGE): Efficiently update existing rows and insert new ones
from delta.tables import DeltaTable
delta_orders = DeltaTable.forPath(spark, "s3://lakehouse/orders/")
new_data = spark.read.parquet("s3://staging/new_orders/")
delta_orders.alias("target").merge(
new_data.alias("source"),
"target.order_id = source.order_id"
).whenMatchedUpdate(set={
"status": "source.status",
"updated_at": "source.updated_at",
}).whenNotMatchedInsertAll().execute()
print("MERGE complete: upserted orders")
Time Travel
# Time Travel: Query data as it existed at any point
# Read the table as of a specific version
orders_v5 = spark.read.format("delta") .option("versionAsOf", 5) .load("s3://lakehouse/orders/")
# Read the table as of a specific timestamp
orders_yesterday = spark.read.format("delta") .option("timestampAsOf", "2025-03-14T12:00:00") .load("s3://lakehouse/orders/")
# View table history
delta_orders = DeltaTable.forPath(spark, "s3://lakehouse/orders/")
delta_orders.history().show(truncate=False)
# Output:
# | version | timestamp | operation | operationParameters |
# | 6 | 2025-03-15 10:00:00 | MERGE | {predicate: order_id...} |
# | 5 | 2025-03-14 10:00:00 | WRITE | {mode: Overwrite} |
# | 4 | 2025-03-13 10:00:00 | WRITE | {mode: Overwrite} |
# Rollback to a previous version
spark.read.format("delta") .option("versionAsOf", 4) .load("s3://lakehouse/orders/") .write.format("delta") .mode("overwrite") .save("s3://lakehouse/orders/")
Apache Iceberg
Apache Iceberg is an open table format designed for huge analytic datasets. Unlike Delta Lake (which originated from Databricks), Iceberg was designed to be engine-agnostic from the start — it works equally well with Spark, Trino, Flink, Dremio, Snowflake, and many other engines. Iceberg provides ACID transactions, schema evolution, hidden partitioning, and time travel.
Iceberg Unique Features
- Hidden Partitioning: Users write queries without knowing the partition scheme. Iceberg automatically prunes partitions based on query filters. Partition schemes can evolve without rewriting data.
- Engine Agnostic: First-class support in Spark, Trino, Flink, Snowflake, BigQuery, Dremio, and Athena — no vendor lock-in.
- Partition Evolution: Change your partitioning scheme (e.g., from daily to hourly) without rewriting existing data. Old and new partitions coexist seamlessly.
- Row-Level Operations: Efficient UPDATE and DELETE operations on specific rows using merge-on-read or copy-on-write strategies.
# Apache Iceberg with PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder .appName("IcebergDemo") .config("spark.sql.catalog.lakehouse", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.lakehouse.type", "hadoop") .config("spark.sql.catalog.lakehouse.warehouse", "s3://lakehouse/iceberg/") .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .getOrCreate()
# Create an Iceberg table with hidden partitioning
spark.sql("""
CREATE TABLE lakehouse.analytics.orders (
order_id BIGINT,
customer_id BIGINT,
product_id BIGINT,
amount DECIMAL(12,2),
order_date DATE,
status STRING,
created_at TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(order_date))
""")
# Insert data — partitioning is automatic and hidden from users
orders_df = spark.read.parquet("s3://raw/orders/")
orders_df.writeTo("lakehouse.analytics.orders").append()
# Query with automatic partition pruning
# Users do not need to know the partition scheme
result = spark.sql("""
SELECT status, COUNT(*), SUM(amount)
FROM lakehouse.analytics.orders
WHERE order_date >= '2025-03-01'
AND order_date < '2025-04-01'
GROUP BY status
""")
result.show()
# Partition evolution: switch from daily to hourly partitioning
# No data rewrite needed!
spark.sql("""
ALTER TABLE lakehouse.analytics.orders
ADD PARTITION FIELD hours(created_at)
""")
# Time travel with Iceberg
history = spark.sql("SELECT * FROM lakehouse.analytics.orders.history")
history.show()
# Query a specific snapshot
spark.sql("""
SELECT * FROM lakehouse.analytics.orders
TIMESTAMP AS OF '2025-03-14 12:00:00'
""")
Delta Lake vs Apache Iceberg
| Feature | Delta Lake | Apache Iceberg |
|---|---|---|
| Origin | Databricks | Netflix / Apache Foundation |
| Engine Support | Best with Spark/Databricks; growing elsewhere | Broad: Spark, Trino, Flink, Snowflake, BigQuery |
| Partitioning | Explicit (user must specify in queries) | Hidden (automatic partition pruning) |
| Partition Evolution | Requires data rewrite | In-place evolution without rewrite |
| Time Travel | Yes (version and timestamp) | Yes (snapshot-based) |
| ACID Transactions | Yes | Yes |
| Schema Evolution | Add/rename/reorder columns | Full evolution including type promotion |
| Community | Strong Databricks ecosystem | Growing multi-vendor ecosystem |
The Medallion Architecture in a Lakehouse
# Lakehouse with Medallion Architecture using Delta Lake
# BRONZE: Raw ingestion — append-only, no transformations
spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "kafka:9092") .option("subscribe", "orders") .load() .selectExpr("CAST(value AS STRING) AS json_data", "timestamp AS kafka_timestamp") .writeStream .format("delta") .outputMode("append") .option("checkpointLocation", "s3://lakehouse/checkpoints/bronze_orders") .table("bronze.raw_orders")
# SILVER: Cleaned, deduplicated, typed data
from delta.tables import DeltaTable
from pyspark.sql.functions import from_json, col, schema_of_json
silver_df = spark.read.format("delta").table("bronze.raw_orders") .select(from_json(col("json_data"), order_schema).alias("data")) .select("data.*") .dropDuplicates(["order_id"]) .filter(col("amount") > 0)
# MERGE into silver (upsert to handle late-arriving updates)
silver_table = DeltaTable.forName(spark, "silver.orders")
silver_table.alias("t").merge(
silver_df.alias("s"), "t.order_id = s.order_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
# GOLD: Business-level aggregates
gold_df = spark.sql("""
SELECT
DATE_TRUNC('month', order_date) AS month,
country,
COUNT(DISTINCT order_id) AS orders,
SUM(amount) AS revenue,
COUNT(DISTINCT customer_id) AS customers
FROM silver.orders
GROUP BY DATE_TRUNC('month', order_date), country
""")
gold_df.write.format("delta") .mode("overwrite") .saveAsTable("gold.monthly_revenue")
When to Choose a Lakehouse
Decision Guide
- Choose Lakehouse when: You need both analytics and ML on the same data, want to avoid duplicating data between lake and warehouse, need open formats for flexibility, or are building a new data platform from scratch
- Choose Cloud Warehouse (Snowflake/BigQuery) when: Your primary use case is BI/analytics, your team is SQL-first, you want a fully managed service, and you do not need ML/data science workloads on the same data
- Choose Both when: Use a lakehouse as the primary storage layer and sync curated datasets to a warehouse for BI. This is common at large organizations.
Key Takeaways
- The lakehouse combines data lake storage (cheap, open formats) with warehouse features (ACID, schema, SQL)
- Delta Lake and Apache Iceberg are the two leading table formats that enable the lakehouse pattern
- Time travel lets you query historical versions of your data — essential for debugging, auditing, and reproducible ML
- The medallion architecture (bronze/silver/gold) organizes lakehouse data by quality level
- Iceberg offers broader engine support and partition evolution; Delta Lake is dominant in Databricks
- The lakehouse is becoming the standard architecture for modern data platforms, unifying analytics and ML workloads