What is dbt?
dbt (data build tool) is the standard tool for managing SQL-based data transformations in modern data warehouses. It enables data teams to write modular SQL models, test data quality, generate documentation, and manage dependencies — all using software engineering best practices like version control, code review, and CI/CD. dbt handles the "T" in ELT, transforming raw data that has already been loaded into your warehouse.
dbt fundamentally changed how data teams work by bringing software engineering discipline to analytics. Before dbt, transformations lived in stored procedures, ad-hoc scripts, or GUI-based ETL tools with no version control and no testing. dbt treats SQL transformations as code — versioned in Git, tested automatically, documented with descriptions and lineage graphs, and deployed through CI/CD pipelines.
Why dbt?
- SQL-First: Write transformations in SQL — the language your analysts already know. No need to learn Python, Scala, or proprietary languages.
- Version Controlled: All models live in Git. Review changes via pull requests. See the history of every transformation.
- Tested: Built-in testing framework validates data quality — not null checks, uniqueness, referential integrity, and custom assertions.
- Documented: Auto-generated documentation includes model descriptions, column definitions, and a visual lineage graph showing data flow.
- Modular: Models reference each other using
ref(), creating a dependency DAG that dbt executes in the correct order. - Environment-Aware: Same code runs in dev, staging, and production with different targets — no manual schema changes.
dbt Project Structure
# Standard dbt project layout
my_dbt_project/
dbt_project.yml # Project configuration
profiles.yml # Connection profiles (local only, not in Git)
packages.yml # External dbt packages
models/
staging/ # Layer 1: Clean raw data
_stg_models.yml # Schema definitions and tests
stg_orders.sql
stg_customers.sql
stg_products.sql
intermediate/ # Layer 2: Business logic
_int_models.yml
int_orders_enriched.sql
marts/ # Layer 3: Final tables for consumers
_marts_models.yml
fct_orders.sql
dim_customer.sql
dim_product.sql
dim_date.sql
metrics/ # Layer 4: Semantic layer definitions
revenue_metrics.yml
tests/ # Custom data tests
assert_positive_revenue.sql
macros/ # Reusable SQL functions
generate_schema_name.sql
cents_to_dollars.sql
seeds/ # Static CSV data (country codes, mappings)
country_codes.csv
snapshots/ # SCD Type 2 tracking
snp_customers.sql
analyses/ # Ad-hoc analytical queries
dbt Project Configuration
# dbt_project.yml
name: 'analytics'
version: '1.0.0'
config-version: 2
profile: 'analytics'
model-paths: ["models"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
analysis-paths: ["analyses"]
# Materialization defaults per folder
models:
analytics:
staging:
+materialized: view # Views are rebuilt on each query
+schema: staging
intermediate:
+materialized: ephemeral # CTEs, not persisted
marts:
+materialized: table # Physical tables for performance
+schema: analytics
+tags: ['daily']
# Variables for dynamic SQL
vars:
start_date: '2024-01-01'
default_country: 'US'
# profiles.yml (in ~/.dbt/, NOT in the Git repo)
analytics:
target: dev
outputs:
dev:
type: snowflake
account: xy12345.us-east-1
user: dev_user
password: "{{ env_var('SNOWFLAKE_DEV_PASSWORD') }}"
role: DEV_ROLE
database: ANALYTICS_DEV
warehouse: DEV_WH
schema: "dbt_{{ env_var('USER') }}" # Each dev gets their own schema
threads: 4
prod:
type: snowflake
account: xy12345.us-east-1
user: dbt_prod_user
password: "{{ env_var('SNOWFLAKE_PROD_PASSWORD') }}"
role: PROD_ROLE
database: ANALYTICS
warehouse: PROD_WH
schema: analytics
threads: 8
Writing dbt Models
A dbt model is simply a SQL SELECT statement in a .sql file. dbt wraps it with the appropriate DDL
(CREATE TABLE, CREATE VIEW, or MERGE) based on the materialization strategy. Models reference each
other using the ref() function, which dbt uses to build the dependency graph.
-- models/staging/stg_orders.sql
-- Staging model: clean and rename raw source data
WITH source AS (
SELECT * FROM {{ source('app_database', 'orders') }}
),
renamed AS (
SELECT
id AS order_id,
user_id AS customer_id,
product_id,
qty AS quantity,
price_cents / 100.0 AS unit_price,
(qty * price_cents) / 100.0 AS total_amount,
CAST(created_at AS DATE) AS order_date,
UPPER(TRIM(status)) AS status,
created_at,
updated_at,
_fivetran_synced AS loaded_at
FROM source
WHERE id IS NOT NULL
AND qty > 0
)
SELECT * FROM renamed
-- models/marts/fct_orders.sql
-- Final fact table referencing staging models
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }}
),
customers AS (
SELECT * FROM {{ ref('stg_customers') }}
),
products AS (
SELECT * FROM {{ ref('stg_products') }}
),
final AS (
SELECT
o.order_id,
o.order_date,
DATE_TRUNC('month', o.order_date) AS order_month,
-- Customer attributes
o.customer_id,
c.customer_name,
c.country,
c.segment AS customer_segment,
-- Product attributes
o.product_id,
p.product_name,
p.category,
-- Measures
o.quantity,
o.unit_price,
o.total_amount,
p.cost * o.quantity AS total_cost,
o.total_amount - (p.cost * o.quantity) AS profit,
-- Derived
CASE
WHEN o.total_amount > 500 THEN 'premium'
WHEN o.total_amount > 100 THEN 'standard'
ELSE 'basic'
END AS order_tier,
o.status
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
LEFT JOIN products p ON o.product_id = p.product_id
WHERE o.status != 'CANCELLED'
)
SELECT * FROM final
Materialization Strategies
| Strategy | What It Creates | When to Use |
|---|---|---|
| view | SQL VIEW | Staging models, lightweight transforms, data that does not need to be persisted |
| table | Physical TABLE (drop + recreate) | Final marts, heavily-queried models, complex transforms |
| incremental | TABLE with MERGE/INSERT | Large tables where full rebuild is too slow — only new/changed rows are processed |
| ephemeral | CTE (no DB object) | Intermediate logic that other models reference but does not need its own table |
Running dbt
# Run all models
dbt run
# Run specific model and its upstream dependencies
dbt run --select fct_orders+
# Run all models with a tag
dbt run --select tag:daily
# Run tests
dbt test
# Generate documentation
dbt docs generate
dbt docs serve
# Full refresh of incremental models
dbt run --full-refresh --select incremental_model
# Compile SQL without executing (debug)
dbt compile --select fct_orders
# Selector syntax examples
dbt run --select staging.* # All models in staging folder
dbt run --select +fct_orders # fct_orders and all upstream
dbt run --select fct_orders+ # fct_orders and all downstream
dbt run --select +fct_orders+ # Full lineage: upstream + downstream
dbt run --select @fct_orders # fct_orders + parents + children of parents
Jinja Templating in dbt
dbt uses Jinja templating to add dynamic logic to SQL. This enables conditional SQL, loops, and variable interpolation. While powerful, Jinja should be used sparingly — the goal is readable SQL, not a programming language embedded in templates.
-- Conditional logic with Jinja
SELECT
order_id,
customer_id,
total_amount,
order_date
{% if target.name == 'dev' %}
-- In dev, limit data for faster iteration
FROM {{ ref('stg_orders') }}
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
{% else %}
-- In prod, process all data
FROM {{ ref('stg_orders') }}
{% endif %}
-- Looping: Generate UNION ALL across multiple sources
{% set payment_methods = ['credit_card', 'paypal', 'bank_transfer', 'apple_pay'] %}
SELECT
order_id,
{% for method in payment_methods %}
SUM(CASE WHEN payment_method = '{{ method }}' THEN amount ELSE 0 END) AS {{ method }}_amount
{% if not loop.last %},{% endif %}
{% endfor %}
FROM payments
GROUP BY order_id
dbt Sources vs Seeds
Sources vs Seeds vs Models
- Sources: External tables already in your warehouse (loaded by Fivetran, Airbyte, etc.). Referenced with
source(). Support freshness checks. - Seeds: Small CSV files version-controlled in your dbt project. Loaded with
dbt seed. Use for static mapping tables, country codes, or config data. - Models: SQL SELECT statements that create tables or views. Referenced with
ref(). The core of your dbt project.
Key Takeaways
- dbt is the standard tool for SQL-based data transformations in modern data warehouses
- Models are SQL SELECT statements; dbt handles DDL (CREATE TABLE/VIEW) automatically
- Use the ref() function to create dependencies between models — dbt builds the execution DAG
- Follow the staging/intermediate/marts layer pattern for clean, maintainable transformations
- Choose materialization based on size and query frequency: views for staging, tables for marts, incremental for large tables
- Jinja templating adds dynamic logic to SQL, but keep it simple — readability is paramount
- dbt brings software engineering practices to analytics: Git, testing, documentation, and CI/CD