TechLead
Lesson 14 of 22
7 min read
Data Engineering

dbt Fundamentals

Learn dbt (data build tool) for SQL-based transformations, project structure, materialization strategies, and the transformation workflow

What is dbt?

dbt (data build tool) is the standard tool for managing SQL-based data transformations in modern data warehouses. It enables data teams to write modular SQL models, test data quality, generate documentation, and manage dependencies — all using software engineering best practices like version control, code review, and CI/CD. dbt handles the "T" in ELT, transforming raw data that has already been loaded into your warehouse.

dbt fundamentally changed how data teams work by bringing software engineering discipline to analytics. Before dbt, transformations lived in stored procedures, ad-hoc scripts, or GUI-based ETL tools with no version control and no testing. dbt treats SQL transformations as code — versioned in Git, tested automatically, documented with descriptions and lineage graphs, and deployed through CI/CD pipelines.

Why dbt?

  • SQL-First: Write transformations in SQL — the language your analysts already know. No need to learn Python, Scala, or proprietary languages.
  • Version Controlled: All models live in Git. Review changes via pull requests. See the history of every transformation.
  • Tested: Built-in testing framework validates data quality — not null checks, uniqueness, referential integrity, and custom assertions.
  • Documented: Auto-generated documentation includes model descriptions, column definitions, and a visual lineage graph showing data flow.
  • Modular: Models reference each other using ref(), creating a dependency DAG that dbt executes in the correct order.
  • Environment-Aware: Same code runs in dev, staging, and production with different targets — no manual schema changes.

dbt Project Structure

# Standard dbt project layout
my_dbt_project/
  dbt_project.yml          # Project configuration
  profiles.yml             # Connection profiles (local only, not in Git)
  packages.yml             # External dbt packages
  models/
    staging/               # Layer 1: Clean raw data
      _stg_models.yml      # Schema definitions and tests
      stg_orders.sql
      stg_customers.sql
      stg_products.sql
    intermediate/          # Layer 2: Business logic
      _int_models.yml
      int_orders_enriched.sql
    marts/                 # Layer 3: Final tables for consumers
      _marts_models.yml
      fct_orders.sql
      dim_customer.sql
      dim_product.sql
      dim_date.sql
    metrics/               # Layer 4: Semantic layer definitions
      revenue_metrics.yml
  tests/                   # Custom data tests
    assert_positive_revenue.sql
  macros/                  # Reusable SQL functions
    generate_schema_name.sql
    cents_to_dollars.sql
  seeds/                   # Static CSV data (country codes, mappings)
    country_codes.csv
  snapshots/               # SCD Type 2 tracking
    snp_customers.sql
  analyses/                # Ad-hoc analytical queries

dbt Project Configuration

# dbt_project.yml
name: 'analytics'
version: '1.0.0'
config-version: 2
profile: 'analytics'

model-paths: ["models"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
analysis-paths: ["analyses"]

# Materialization defaults per folder
models:
  analytics:
    staging:
      +materialized: view        # Views are rebuilt on each query
      +schema: staging
    intermediate:
      +materialized: ephemeral   # CTEs, not persisted
    marts:
      +materialized: table       # Physical tables for performance
      +schema: analytics
      +tags: ['daily']

# Variables for dynamic SQL
vars:
  start_date: '2024-01-01'
  default_country: 'US'
# profiles.yml (in ~/.dbt/, NOT in the Git repo)
analytics:
  target: dev
  outputs:
    dev:
      type: snowflake
      account: xy12345.us-east-1
      user: dev_user
      password: "{{ env_var('SNOWFLAKE_DEV_PASSWORD') }}"
      role: DEV_ROLE
      database: ANALYTICS_DEV
      warehouse: DEV_WH
      schema: "dbt_{{ env_var('USER') }}"  # Each dev gets their own schema
      threads: 4

    prod:
      type: snowflake
      account: xy12345.us-east-1
      user: dbt_prod_user
      password: "{{ env_var('SNOWFLAKE_PROD_PASSWORD') }}"
      role: PROD_ROLE
      database: ANALYTICS
      warehouse: PROD_WH
      schema: analytics
      threads: 8

Writing dbt Models

A dbt model is simply a SQL SELECT statement in a .sql file. dbt wraps it with the appropriate DDL (CREATE TABLE, CREATE VIEW, or MERGE) based on the materialization strategy. Models reference each other using the ref() function, which dbt uses to build the dependency graph.

-- models/staging/stg_orders.sql
-- Staging model: clean and rename raw source data

WITH source AS (
    SELECT * FROM {{ source('app_database', 'orders') }}
),

renamed AS (
    SELECT
        id AS order_id,
        user_id AS customer_id,
        product_id,
        qty AS quantity,
        price_cents / 100.0 AS unit_price,
        (qty * price_cents) / 100.0 AS total_amount,
        CAST(created_at AS DATE) AS order_date,
        UPPER(TRIM(status)) AS status,
        created_at,
        updated_at,
        _fivetran_synced AS loaded_at
    FROM source
    WHERE id IS NOT NULL
      AND qty > 0
)

SELECT * FROM renamed
-- models/marts/fct_orders.sql
-- Final fact table referencing staging models

WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),

customers AS (
    SELECT * FROM {{ ref('stg_customers') }}
),

products AS (
    SELECT * FROM {{ ref('stg_products') }}
),

final AS (
    SELECT
        o.order_id,
        o.order_date,
        DATE_TRUNC('month', o.order_date) AS order_month,

        -- Customer attributes
        o.customer_id,
        c.customer_name,
        c.country,
        c.segment AS customer_segment,

        -- Product attributes
        o.product_id,
        p.product_name,
        p.category,

        -- Measures
        o.quantity,
        o.unit_price,
        o.total_amount,
        p.cost * o.quantity AS total_cost,
        o.total_amount - (p.cost * o.quantity) AS profit,

        -- Derived
        CASE
            WHEN o.total_amount > 500 THEN 'premium'
            WHEN o.total_amount > 100 THEN 'standard'
            ELSE 'basic'
        END AS order_tier,

        o.status
    FROM orders o
    LEFT JOIN customers c ON o.customer_id = c.customer_id
    LEFT JOIN products p ON o.product_id = p.product_id
    WHERE o.status != 'CANCELLED'
)

SELECT * FROM final

Materialization Strategies

Strategy What It Creates When to Use
viewSQL VIEWStaging models, lightweight transforms, data that does not need to be persisted
tablePhysical TABLE (drop + recreate)Final marts, heavily-queried models, complex transforms
incrementalTABLE with MERGE/INSERTLarge tables where full rebuild is too slow — only new/changed rows are processed
ephemeralCTE (no DB object)Intermediate logic that other models reference but does not need its own table

Running dbt

# Run all models
dbt run

# Run specific model and its upstream dependencies
dbt run --select fct_orders+

# Run all models with a tag
dbt run --select tag:daily

# Run tests
dbt test

# Generate documentation
dbt docs generate
dbt docs serve

# Full refresh of incremental models
dbt run --full-refresh --select incremental_model

# Compile SQL without executing (debug)
dbt compile --select fct_orders

# Selector syntax examples
dbt run --select staging.*            # All models in staging folder
dbt run --select +fct_orders          # fct_orders and all upstream
dbt run --select fct_orders+          # fct_orders and all downstream
dbt run --select +fct_orders+         # Full lineage: upstream + downstream
dbt run --select @fct_orders          # fct_orders + parents + children of parents

Jinja Templating in dbt

dbt uses Jinja templating to add dynamic logic to SQL. This enables conditional SQL, loops, and variable interpolation. While powerful, Jinja should be used sparingly — the goal is readable SQL, not a programming language embedded in templates.

-- Conditional logic with Jinja
SELECT
    order_id,
    customer_id,
    total_amount,
    order_date

    {% if target.name == 'dev' %}
    -- In dev, limit data for faster iteration
    FROM {{ ref('stg_orders') }}
    WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
    {% else %}
    -- In prod, process all data
    FROM {{ ref('stg_orders') }}
    {% endif %}

-- Looping: Generate UNION ALL across multiple sources
{% set payment_methods = ['credit_card', 'paypal', 'bank_transfer', 'apple_pay'] %}

SELECT
    order_id,
    {% for method in payment_methods %}
    SUM(CASE WHEN payment_method = '{{ method }}' THEN amount ELSE 0 END) AS {{ method }}_amount
    {% if not loop.last %},{% endif %}
    {% endfor %}
FROM payments
GROUP BY order_id

dbt Sources vs Seeds

Sources vs Seeds vs Models

  • Sources: External tables already in your warehouse (loaded by Fivetran, Airbyte, etc.). Referenced with source(). Support freshness checks.
  • Seeds: Small CSV files version-controlled in your dbt project. Loaded with dbt seed. Use for static mapping tables, country codes, or config data.
  • Models: SQL SELECT statements that create tables or views. Referenced with ref(). The core of your dbt project.

Key Takeaways

  • dbt is the standard tool for SQL-based data transformations in modern data warehouses
  • Models are SQL SELECT statements; dbt handles DDL (CREATE TABLE/VIEW) automatically
  • Use the ref() function to create dependencies between models — dbt builds the execution DAG
  • Follow the staging/intermediate/marts layer pattern for clean, maintainable transformations
  • Choose materialization based on size and query frequency: views for staging, tables for marts, incremental for large tables
  • Jinja templating adds dynamic logic to SQL, but keep it simple — readability is paramount
  • dbt brings software engineering practices to analytics: Git, testing, documentation, and CI/CD

Continue Learning