Understanding Data Integration
Data integration is the process of combining data from multiple sources into a unified view for analysis. The two dominant paradigms are ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform). Understanding the differences between these approaches is fundamental to designing effective data pipelines, and the choice between them shapes your entire data architecture.
Historically, ETL was the only option because data warehouses had limited compute power and storage was expensive. You had to transform data before loading it because warehouses could not handle raw data at scale. The rise of cloud data warehouses like Snowflake, BigQuery, and Redshift — with virtually unlimited compute and cheap storage — has made ELT the preferred approach for most modern data teams.
ETL: Extract-Transform-Load
In the ETL pattern, data is extracted from source systems, transformed in a staging area (often a dedicated ETL server), and then loaded into the target warehouse in its final form. The transformation happens outside the warehouse, typically using a processing engine like Spark, Python scripts, or a dedicated ETL tool like Informatica or Talend.
ETL Characteristics
- Transform Before Load: Data is cleaned, validated, and structured before it reaches the warehouse
- External Processing: Transformations run on a separate compute layer (Spark cluster, ETL server)
- Data Arrives Clean: The warehouse only contains curated, business-ready data
- Higher Upfront Cost: Requires maintaining transformation infrastructure and complex mapping logic
- Less Flexibility: Changing transformations requires modifying pipeline code and reprocessing
import pandas as pd
from sqlalchemy import create_engine
# EXTRACT: Pull data from source database
source_engine = create_engine("postgresql://source_db:5432/app")
raw_orders = pd.read_sql("""
SELECT order_id, customer_id, product_id, quantity,
unit_price, order_date, status
FROM orders
WHERE order_date >= '2025-01-01'
""", source_engine)
raw_customers = pd.read_sql("""
SELECT customer_id, name, email, country, created_at
FROM customers
""", source_engine)
# TRANSFORM: Clean, validate, and join data outside the warehouse
# Remove duplicates
raw_orders = raw_orders.drop_duplicates(subset=["order_id"])
# Calculate derived fields
raw_orders["total_amount"] = raw_orders["quantity"] * raw_orders["unit_price"]
raw_orders["order_date"] = pd.to_datetime(raw_orders["order_date"])
raw_orders["order_month"] = raw_orders["order_date"].dt.to_period("M")
# Join orders with customers
enriched_orders = raw_orders.merge(
raw_customers[["customer_id", "country"]],
on="customer_id",
how="left"
)
# Filter out cancelled orders
enriched_orders = enriched_orders[enriched_orders["status"] != "cancelled"]
# Aggregate to monthly summary
monthly_summary = enriched_orders.groupby(["order_month", "country"]).agg(
total_orders=("order_id", "count"),
total_revenue=("total_amount", "sum"),
unique_customers=("customer_id", "nunique"),
).reset_index()
# LOAD: Write transformed data to the warehouse
warehouse_engine = create_engine("snowflake://warehouse/analytics")
monthly_summary.to_sql(
"monthly_revenue_by_country",
warehouse_engine,
schema="analytics",
if_exists="replace",
index=False
)
print(f"Loaded {len(monthly_summary)} rows into analytics.monthly_revenue_by_country")
ELT: Extract-Load-Transform
In the ELT pattern, raw data is extracted from sources and loaded directly into the warehouse (or data lake) without transformation. Transformations are then performed inside the warehouse using SQL. This approach leverages the massive compute power of modern cloud warehouses. The raw data is preserved, and transformations can be changed or added at any time without re-extracting from source systems.
ELT Characteristics
- Load Then Transform: Raw data lands in the warehouse first, transformations happen in-place using SQL
- Warehouse-Native Processing: Leverage BigQuery, Snowflake, or Redshift compute for transformations
- Raw Data Preserved: The original data is always available for new transformations or debugging
- Faster Time to Value: Data arrives in the warehouse quickly; transformations evolve independently
- dbt-Powered: dbt has become the standard tool for managing SQL transformations in ELT architectures
-- ELT: Raw data is already loaded into the warehouse by an ingestion tool
-- (Fivetran, Airbyte, etc.) into a raw schema
-- Step 1: Staging model — clean and rename raw data
CREATE OR REPLACE VIEW staging.stg_orders AS
SELECT
id AS order_id,
customer_id,
product_id,
quantity,
unit_price,
quantity * unit_price AS total_amount,
CAST(created_at AS DATE) AS order_date,
DATE_TRUNC('month', created_at) AS order_month,
status,
_fivetran_synced AS loaded_at
FROM raw.app_public.orders
WHERE status != 'cancelled'
AND id IS NOT NULL;
-- Step 2: Create enriched model by joining in the warehouse
CREATE OR REPLACE TABLE analytics.orders_enriched AS
SELECT
o.order_id,
o.customer_id,
c.name AS customer_name,
c.country,
o.product_id,
p.product_name,
p.category,
o.quantity,
o.unit_price,
o.total_amount,
o.order_date,
o.order_month
FROM staging.stg_orders o
LEFT JOIN staging.stg_customers c ON o.customer_id = c.customer_id
LEFT JOIN staging.stg_products p ON o.product_id = p.product_id;
-- Step 3: Aggregate for reporting
CREATE OR REPLACE TABLE analytics.monthly_revenue AS
SELECT
order_month,
country,
COUNT(DISTINCT order_id) AS total_orders,
SUM(total_amount) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers,
SUM(total_amount) / COUNT(DISTINCT order_id) AS avg_order_value
FROM analytics.orders_enriched
GROUP BY order_month, country
ORDER BY order_month DESC, total_revenue DESC;
ETL vs ELT: Detailed Comparison
| Dimension | ETL | ELT |
|---|---|---|
| Transform Location | External processing engine | Inside the warehouse |
| Primary Language | Python, Java, Scala | SQL |
| Raw Data Preserved | Usually not in warehouse | Yes, always available |
| Flexibility | Requires pipeline changes to add new transforms | Add new SQL models anytime |
| Complexity | Higher — manage separate compute | Lower — warehouse handles compute |
| Cost Model | Pay for ETL servers + warehouse | Pay for warehouse compute + storage |
| Latency | Higher — transform adds time before load | Lower to load, transform runs after |
| Data Volume | Better for very large volumes (Spark) | Great up to petabyte scale in cloud DW |
| Typical Tools | Spark, Informatica, Talend, Airflow | Fivetran + dbt + Snowflake/BigQuery |
When to Use ETL
ETL still makes sense in several scenarios:
- Sensitive Data: When PII must be masked or encrypted before it reaches the warehouse — you cannot risk raw PII landing in shared storage
- Complex Transformations: When transformations require procedural logic, ML models, or external API calls that cannot be expressed in SQL
- Very Large Scale: When processing petabytes of data where Spark's distributed compute is more cost-effective than warehouse compute
- Legacy Systems: When loading into on-premise data warehouses with limited compute capacity
- Real-Time Requirements: Stream processing pipelines (Kafka Streams, Flink) inherently follow ETL — transform before writing to the sink
When to Use ELT
ELT is the preferred choice for most modern data teams:
- Cloud Data Warehouses: When using Snowflake, BigQuery, or Redshift with elastic compute
- Analyst Empowerment: When you want analysts and analytics engineers to own transformations using SQL and dbt
- Rapid Iteration: When transformation logic changes frequently and you need fast feedback loops
- Data Exploration: When you want raw data available for ad-hoc analysis and new use cases
- Managed Ingestion: When using tools like Fivetran or Airbyte that handle extraction and loading automatically
Hybrid Approaches
In practice, most organizations use a hybrid approach. Raw ingestion follows the ELT pattern (load first, transform in the warehouse), but certain pipelines use ETL when the situation demands it. For example:
# Hybrid approach: ETL for sensitive data, ELT for the rest
# ETL: Mask PII before loading into the warehouse
def mask_pii_and_load(records: list[dict]):
"""ETL pipeline that masks PII before warehouse load."""
import hashlib
masked = []
for record in records:
masked.append({
"user_id": record["user_id"],
"email_hash": hashlib.sha256(
record["email"].encode()
).hexdigest(),
"country": record["country"],
"signup_date": record["signup_date"],
# Original email is never stored in the warehouse
})
load_to_warehouse(masked, "raw.users_masked")
# ELT: Everything else loads raw, transforms in SQL via dbt
# Fivetran syncs orders, products, events into raw schema
# dbt transforms raw -> staging -> analytics
The ELT Stack in Practice
A typical modern ELT stack looks like this:
# dbt_project.yml — The transformation layer in an ELT architecture
name: 'analytics'
version: '1.0.0'
profile: 'snowflake_prod'
models:
analytics:
staging: # Clean and rename raw data
+materialized: view
+schema: staging
intermediate: # Business logic and joins
+materialized: ephemeral
marts: # Final tables for analytics
+materialized: table
+schema: analytics
# The raw schema is populated by Fivetran/Airbyte (Extract + Load)
# dbt handles all transformations (Transform)
# This separation is the essence of ELT
Key Takeaways
- ETL transforms data before loading; ELT loads raw data first and transforms in the warehouse
- ELT is the dominant pattern in the modern data stack thanks to powerful cloud warehouses
- ETL is still necessary for PII masking, complex non-SQL transforms, and very large scale processing
- Most organizations use a hybrid approach — ELT for the majority, ETL for special cases
- dbt has become the standard tool for the Transform step in ELT architectures
- The choice between ETL and ELT shapes your tool selection, team structure, and data architecture