Storage Architecture Fundamentals
Choosing the right storage architecture is one of the most consequential decisions in data engineering. The two primary paradigms — data warehouses and data lakes — serve different purposes and have different strengths. Understanding their characteristics, trade-offs, and ideal use cases is essential for designing effective data platforms. A wrong choice can lead to years of technical debt, excessive costs, and frustrated stakeholders.
The key distinction comes down to structure and purpose: warehouses are optimized for structured, curated data and fast analytical queries, while lakes are designed for storing massive volumes of raw data in any format at low cost. Modern architectures increasingly blend both approaches through the lakehouse pattern.
Data Warehouses
A data warehouse is a centralized repository of structured, curated data optimized for analytical queries. Data is organized into schemas with defined types and relationships. Warehouses use columnar storage formats and massively parallel processing (MPP) engines to execute complex SQL queries across billions of rows in seconds.
Warehouse Characteristics
- Schema-on-Write: Data must conform to a defined schema before it is written — types, constraints, and relationships are enforced at load time
- Columnar Storage: Data is stored by column rather than by row, enabling efficient compression and fast analytical queries that scan specific columns
- SQL Interface: Primary interaction is through SQL, making warehouses accessible to analysts, engineers, and BI tools alike
- ACID Compliance: Transactions are atomic, consistent, isolated, and durable — ensuring data reliability
- Separation of Compute and Storage: Modern cloud warehouses (Snowflake, BigQuery) let you scale compute and storage independently
- Managed Services: Cloud warehouses handle infrastructure, scaling, backups, and security automatically
-- Creating a structured table in a data warehouse
CREATE TABLE analytics.fact_orders (
order_id BIGINT PRIMARY KEY,
customer_id BIGINT NOT NULL,
product_id BIGINT NOT NULL,
quantity INT NOT NULL,
unit_price DECIMAL(10,2) NOT NULL,
total_amount DECIMAL(12,2) NOT NULL,
discount_pct DECIMAL(5,2) DEFAULT 0,
order_date DATE NOT NULL,
ship_date DATE,
status VARCHAR(20) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
PARTITION BY RANGE (order_date);
-- Partitioning by date for efficient time-range queries
-- Create partitions
CREATE TABLE analytics.fact_orders_2025_q1
PARTITION OF analytics.fact_orders
FOR VALUES FROM ('2025-01-01') TO ('2025-04-01');
-- Warehouse query: fast analytical query scanning only relevant columns
SELECT
DATE_TRUNC('month', order_date) AS month,
COUNT(DISTINCT customer_id) AS unique_customers,
COUNT(*) AS total_orders,
SUM(total_amount) AS revenue,
AVG(total_amount) AS avg_order_value
FROM analytics.fact_orders
WHERE order_date BETWEEN '2025-01-01' AND '2025-12-31'
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;
Data Lakes
A data lake is a centralized repository that stores raw data in its native format — structured, semi-structured, or unstructured — at virtually any scale. Unlike warehouses, data lakes do not require you to define a schema before writing data. They use cheap object storage (like Amazon S3, Azure Data Lake Storage, or Google Cloud Storage) and apply schema when reading the data (schema-on-read).
Data Lake Characteristics
- Schema-on-Read: Data is stored raw; schema is applied when you query it, allowing maximum flexibility for different consumers
- Any Format: Store CSV, JSON, Parquet, Avro, images, video, logs — any data type in its native format
- Extremely Cheap Storage: Object storage costs pennies per GB per month, enabling petabyte-scale storage
- Separate Compute: Processing engines (Spark, Presto, Athena) run independently from storage
- ML-Friendly: Data scientists can access raw data directly for feature engineering and model training
- Risk of Data Swamp: Without governance, lakes become disorganized dumping grounds with undocumented, stale data
import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime
import boto3
# Writing data to a data lake in Parquet format (columnar, compressed)
orders_table = pa.table({
"order_id": [1001, 1002, 1003, 1004, 1005],
"customer_id": [501, 502, 501, 503, 504],
"product": ["Widget A", "Widget B", "Widget C", "Widget A", "Widget D"],
"amount": [29.99, 49.99, 19.99, 29.99, 99.99],
"order_date": [
datetime(2025, 3, 1), datetime(2025, 3, 1),
datetime(2025, 3, 2), datetime(2025, 3, 2),
datetime(2025, 3, 3)
],
})
# Partition by date for efficient querying
pq.write_to_dataset(
orders_table,
root_path="s3://data-lake/raw/orders/",
partition_cols=["order_date"],
compression="snappy"
)
# Data lake directory structure:
# s3://data-lake/raw/orders/order_date=2025-03-01/part-0.parquet
# s3://data-lake/raw/orders/order_date=2025-03-02/part-0.parquet
# s3://data-lake/raw/orders/order_date=2025-03-03/part-0.parquet
# Reading data with schema-on-read using Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("LakeQuery").getOrCreate()
# Spark infers schema from Parquet metadata
orders_df = spark.read.parquet("s3://data-lake/raw/orders/")
# Partition pruning: only reads partitions matching the filter
recent_orders = orders_df.filter(
orders_df.order_date >= "2025-03-02"
).groupBy("product").sum("amount")
recent_orders.show()
Warehouse vs Lake: Detailed Comparison
| Dimension | Data Warehouse | Data Lake |
|---|---|---|
| Schema | Schema-on-write (defined upfront) | Schema-on-read (applied at query time) |
| Data Types | Structured (tables, columns) | Any (structured, semi-structured, unstructured) |
| Storage Format | Proprietary columnar format | Open formats (Parquet, ORC, Avro, JSON) |
| Query Performance | Excellent — optimized for SQL analytics | Variable — depends on format and engine |
| Cost | Higher per-GB (compute + storage) | Very low storage; pay per query |
| Users | Analysts, BI tools, dbt | Data scientists, ML engineers, Spark |
| ACID Transactions | Full support | Limited (unless using Delta/Iceberg) |
| Governance | Built-in access controls and auditing | Requires external catalog and policies |
| Examples | Snowflake, BigQuery, Redshift | S3 + Spark, ADLS + Databricks |
The Data Swamp Problem
Without proper governance, data lakes degrade into data swamps — disorganized repositories where data is dumped but never curated, documented, or maintained. Common symptoms include:
- No metadata or documentation about what datasets contain or when they were last updated
- Duplicate datasets with different names and no clear owner
- Stale data that has not been refreshed in months but is still being used in reports
- No access controls — everyone can read and write anything
- Mixed quality — some datasets are curated, others are raw dumps with no validation
Preventing data swamps requires a data catalog (DataHub, Amundsen), clear ownership, naming conventions, retention policies, and automated quality checks. The medallion architecture (bronze/silver/gold) provides a framework for organizing lake data by quality level.
The Medallion Architecture
The medallion architecture organizes data lake content into three layers based on quality and refinement level. This prevents the data swamp problem by giving each dataset a clear place in the pipeline:
# Medallion Architecture: Bronze -> Silver -> Gold
# BRONZE LAYER: Raw data as-is from sources
# s3://data-lake/bronze/orders/2025/03/01/raw_orders.json
# - No transformations applied
# - Append-only, immutable
# - Full history preserved
# SILVER LAYER: Cleaned, deduplicated, typed data
# s3://data-lake/silver/orders/order_date=2025-03-01/part-0.parquet
# - Schema enforced (Parquet with defined types)
# - Duplicates removed
# - Invalid records quarantined
# - Standardized column names
# GOLD LAYER: Business-level aggregates ready for analytics
# s3://data-lake/gold/monthly_revenue/month=2025-03/part-0.parquet
# - Pre-aggregated for common queries
# - Optimized for dashboard consumption
# - Conforms to business definitions
# Processing pipeline
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, count, sum as spark_sum
spark = SparkSession.builder.appName("Medallion").getOrCreate()
# Bronze to Silver
bronze_df = spark.read.json("s3://data-lake/bronze/orders/2025/03/")
silver_df = (bronze_df
.dropDuplicates(["order_id"])
.filter(col("order_id").isNotNull())
.withColumn("amount", col("amount").cast("decimal(10,2)"))
.withColumn("order_date", col("order_date").cast("date"))
)
silver_df.write.partitionBy("order_date").parquet(
"s3://data-lake/silver/orders/", mode="overwrite"
)
# Silver to Gold
gold_df = (silver_df
.groupBy("order_date")
.agg(
count("order_id").alias("total_orders"),
spark_sum("amount").alias("total_revenue"),
)
)
gold_df.write.parquet("s3://data-lake/gold/daily_revenue/", mode="overwrite")
Choosing the Right Architecture
Decision Guide
- Choose a Warehouse when: Your primary use case is BI/analytics with SQL, you need fast query performance, your data is mostly structured, and governance is a priority
- Choose a Data Lake when: You need to store massive volumes of raw data cheaply, support ML workloads, handle diverse data formats, or need an archive of all historical data
- Choose a Lakehouse when: You want the best of both worlds — cheap lake storage with warehouse-like query performance and ACID transactions (Delta Lake, Apache Iceberg)
- Use Both when: Many organizations maintain a lake for raw storage and ML, with a warehouse for curated analytics. Data flows from lake to warehouse through transformation pipelines.
Key Takeaways
- Data warehouses excel at structured, curated data with fast SQL analytics — ideal for BI and reporting
- Data lakes store raw data of any format at low cost — ideal for ML and data exploration
- Without governance, data lakes become data swamps with undocumented, unreliable data
- The medallion architecture (bronze/silver/gold) organizes lake data by quality level
- The lakehouse pattern combines lake storage with warehouse features using Delta Lake or Apache Iceberg
- Most modern organizations use a combination of lake and warehouse, with the lakehouse bridging the gap