TechLead
Lesson 2 of 22
7 min read
Data Engineering

ETL vs ELT: Data Integration Patterns

Compare Extract-Transform-Load and Extract-Load-Transform approaches, their trade-offs, and when to use each

Understanding Data Integration

Data integration is the process of combining data from multiple sources into a unified view for analysis. The two dominant paradigms are ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform). Understanding the differences between these approaches is fundamental to designing effective data pipelines, and the choice between them shapes your entire data architecture.

Historically, ETL was the only option because data warehouses had limited compute power and storage was expensive. You had to transform data before loading it because warehouses could not handle raw data at scale. The rise of cloud data warehouses like Snowflake, BigQuery, and Redshift — with virtually unlimited compute and cheap storage — has made ELT the preferred approach for most modern data teams.

ETL: Extract-Transform-Load

In the ETL pattern, data is extracted from source systems, transformed in a staging area (often a dedicated ETL server), and then loaded into the target warehouse in its final form. The transformation happens outside the warehouse, typically using a processing engine like Spark, Python scripts, or a dedicated ETL tool like Informatica or Talend.

ETL Characteristics

  • Transform Before Load: Data is cleaned, validated, and structured before it reaches the warehouse
  • External Processing: Transformations run on a separate compute layer (Spark cluster, ETL server)
  • Data Arrives Clean: The warehouse only contains curated, business-ready data
  • Higher Upfront Cost: Requires maintaining transformation infrastructure and complex mapping logic
  • Less Flexibility: Changing transformations requires modifying pipeline code and reprocessing
import pandas as pd
from sqlalchemy import create_engine

# EXTRACT: Pull data from source database
source_engine = create_engine("postgresql://source_db:5432/app")
raw_orders = pd.read_sql("""
    SELECT order_id, customer_id, product_id, quantity,
           unit_price, order_date, status
    FROM orders
    WHERE order_date >= '2025-01-01'
""", source_engine)

raw_customers = pd.read_sql("""
    SELECT customer_id, name, email, country, created_at
    FROM customers
""", source_engine)

# TRANSFORM: Clean, validate, and join data outside the warehouse
# Remove duplicates
raw_orders = raw_orders.drop_duplicates(subset=["order_id"])

# Calculate derived fields
raw_orders["total_amount"] = raw_orders["quantity"] * raw_orders["unit_price"]
raw_orders["order_date"] = pd.to_datetime(raw_orders["order_date"])
raw_orders["order_month"] = raw_orders["order_date"].dt.to_period("M")

# Join orders with customers
enriched_orders = raw_orders.merge(
    raw_customers[["customer_id", "country"]],
    on="customer_id",
    how="left"
)

# Filter out cancelled orders
enriched_orders = enriched_orders[enriched_orders["status"] != "cancelled"]

# Aggregate to monthly summary
monthly_summary = enriched_orders.groupby(["order_month", "country"]).agg(
    total_orders=("order_id", "count"),
    total_revenue=("total_amount", "sum"),
    unique_customers=("customer_id", "nunique"),
).reset_index()

# LOAD: Write transformed data to the warehouse
warehouse_engine = create_engine("snowflake://warehouse/analytics")
monthly_summary.to_sql(
    "monthly_revenue_by_country",
    warehouse_engine,
    schema="analytics",
    if_exists="replace",
    index=False
)
print(f"Loaded {len(monthly_summary)} rows into analytics.monthly_revenue_by_country")

ELT: Extract-Load-Transform

In the ELT pattern, raw data is extracted from sources and loaded directly into the warehouse (or data lake) without transformation. Transformations are then performed inside the warehouse using SQL. This approach leverages the massive compute power of modern cloud warehouses. The raw data is preserved, and transformations can be changed or added at any time without re-extracting from source systems.

ELT Characteristics

  • Load Then Transform: Raw data lands in the warehouse first, transformations happen in-place using SQL
  • Warehouse-Native Processing: Leverage BigQuery, Snowflake, or Redshift compute for transformations
  • Raw Data Preserved: The original data is always available for new transformations or debugging
  • Faster Time to Value: Data arrives in the warehouse quickly; transformations evolve independently
  • dbt-Powered: dbt has become the standard tool for managing SQL transformations in ELT architectures
-- ELT: Raw data is already loaded into the warehouse by an ingestion tool
-- (Fivetran, Airbyte, etc.) into a raw schema

-- Step 1: Staging model — clean and rename raw data
CREATE OR REPLACE VIEW staging.stg_orders AS
SELECT
    id AS order_id,
    customer_id,
    product_id,
    quantity,
    unit_price,
    quantity * unit_price AS total_amount,
    CAST(created_at AS DATE) AS order_date,
    DATE_TRUNC('month', created_at) AS order_month,
    status,
    _fivetran_synced AS loaded_at
FROM raw.app_public.orders
WHERE status != 'cancelled'
  AND id IS NOT NULL;

-- Step 2: Create enriched model by joining in the warehouse
CREATE OR REPLACE TABLE analytics.orders_enriched AS
SELECT
    o.order_id,
    o.customer_id,
    c.name AS customer_name,
    c.country,
    o.product_id,
    p.product_name,
    p.category,
    o.quantity,
    o.unit_price,
    o.total_amount,
    o.order_date,
    o.order_month
FROM staging.stg_orders o
LEFT JOIN staging.stg_customers c ON o.customer_id = c.customer_id
LEFT JOIN staging.stg_products p ON o.product_id = p.product_id;

-- Step 3: Aggregate for reporting
CREATE OR REPLACE TABLE analytics.monthly_revenue AS
SELECT
    order_month,
    country,
    COUNT(DISTINCT order_id) AS total_orders,
    SUM(total_amount) AS total_revenue,
    COUNT(DISTINCT customer_id) AS unique_customers,
    SUM(total_amount) / COUNT(DISTINCT order_id) AS avg_order_value
FROM analytics.orders_enriched
GROUP BY order_month, country
ORDER BY order_month DESC, total_revenue DESC;

ETL vs ELT: Detailed Comparison

Dimension ETL ELT
Transform LocationExternal processing engineInside the warehouse
Primary LanguagePython, Java, ScalaSQL
Raw Data PreservedUsually not in warehouseYes, always available
FlexibilityRequires pipeline changes to add new transformsAdd new SQL models anytime
ComplexityHigher — manage separate computeLower — warehouse handles compute
Cost ModelPay for ETL servers + warehousePay for warehouse compute + storage
LatencyHigher — transform adds time before loadLower to load, transform runs after
Data VolumeBetter for very large volumes (Spark)Great up to petabyte scale in cloud DW
Typical ToolsSpark, Informatica, Talend, AirflowFivetran + dbt + Snowflake/BigQuery

When to Use ETL

ETL still makes sense in several scenarios:

  • Sensitive Data: When PII must be masked or encrypted before it reaches the warehouse — you cannot risk raw PII landing in shared storage
  • Complex Transformations: When transformations require procedural logic, ML models, or external API calls that cannot be expressed in SQL
  • Very Large Scale: When processing petabytes of data where Spark's distributed compute is more cost-effective than warehouse compute
  • Legacy Systems: When loading into on-premise data warehouses with limited compute capacity
  • Real-Time Requirements: Stream processing pipelines (Kafka Streams, Flink) inherently follow ETL — transform before writing to the sink

When to Use ELT

ELT is the preferred choice for most modern data teams:

  • Cloud Data Warehouses: When using Snowflake, BigQuery, or Redshift with elastic compute
  • Analyst Empowerment: When you want analysts and analytics engineers to own transformations using SQL and dbt
  • Rapid Iteration: When transformation logic changes frequently and you need fast feedback loops
  • Data Exploration: When you want raw data available for ad-hoc analysis and new use cases
  • Managed Ingestion: When using tools like Fivetran or Airbyte that handle extraction and loading automatically

Hybrid Approaches

In practice, most organizations use a hybrid approach. Raw ingestion follows the ELT pattern (load first, transform in the warehouse), but certain pipelines use ETL when the situation demands it. For example:

# Hybrid approach: ETL for sensitive data, ELT for the rest

# ETL: Mask PII before loading into the warehouse
def mask_pii_and_load(records: list[dict]):
    """ETL pipeline that masks PII before warehouse load."""
    import hashlib

    masked = []
    for record in records:
        masked.append({
            "user_id": record["user_id"],
            "email_hash": hashlib.sha256(
                record["email"].encode()
            ).hexdigest(),
            "country": record["country"],
            "signup_date": record["signup_date"],
            # Original email is never stored in the warehouse
        })

    load_to_warehouse(masked, "raw.users_masked")

# ELT: Everything else loads raw, transforms in SQL via dbt
# Fivetran syncs orders, products, events into raw schema
# dbt transforms raw -> staging -> analytics

The ELT Stack in Practice

A typical modern ELT stack looks like this:

# dbt_project.yml — The transformation layer in an ELT architecture
name: 'analytics'
version: '1.0.0'
profile: 'snowflake_prod'

models:
  analytics:
    staging:        # Clean and rename raw data
      +materialized: view
      +schema: staging
    intermediate:   # Business logic and joins
      +materialized: ephemeral
    marts:          # Final tables for analytics
      +materialized: table
      +schema: analytics

# The raw schema is populated by Fivetran/Airbyte (Extract + Load)
# dbt handles all transformations (Transform)
# This separation is the essence of ELT

Key Takeaways

  • ETL transforms data before loading; ELT loads raw data first and transforms in the warehouse
  • ELT is the dominant pattern in the modern data stack thanks to powerful cloud warehouses
  • ETL is still necessary for PII masking, complex non-SQL transforms, and very large scale processing
  • Most organizations use a hybrid approach — ELT for the majority, ETL for special cases
  • dbt has become the standard tool for the Transform step in ELT architectures
  • The choice between ETL and ELT shapes your tool selection, team structure, and data architecture

Continue Learning