TechLead
Lesson 17 of 22
6 min read
Data Engineering

Advanced SQL: CTEs and Recursive Queries

Master Common Table Expressions, recursive CTEs for hierarchical data, and advanced query composition patterns

Common Table Expressions (CTEs)

Common Table Expressions (CTEs) are named, temporary result sets defined with the WITH keyword. They make complex queries readable, maintainable, and self-documenting by breaking them into logical steps. CTEs are the foundation of well-written analytical SQL and are used extensively in dbt models, pipeline transformations, and ad-hoc analysis.

Think of CTEs as creating named "building blocks" that you assemble into a final query. Each CTE can reference CTEs defined before it, creating a clear, top-to-bottom flow of data transformations. This is far more readable than deeply nested subqueries and makes debugging much easier — you can run any CTE independently to inspect its output.

-- CTE basics: Named result sets that compose into a final query
WITH daily_orders AS (
    -- Step 1: Aggregate orders by date
    SELECT
        order_date,
        COUNT(DISTINCT order_id) AS order_count,
        SUM(total_amount) AS revenue,
        COUNT(DISTINCT customer_id) AS unique_customers
    FROM analytics.fct_orders
    WHERE order_date >= '2025-01-01'
    GROUP BY order_date
),

daily_with_prev AS (
    -- Step 2: Add previous day's values for comparison
    SELECT
        *,
        LAG(revenue) OVER (ORDER BY order_date) AS prev_day_revenue,
        LAG(order_count) OVER (ORDER BY order_date) AS prev_day_orders
    FROM daily_orders
),

final AS (
    -- Step 3: Calculate growth metrics
    SELECT
        order_date,
        order_count,
        revenue,
        unique_customers,
        revenue - prev_day_revenue AS revenue_change,
        ROUND((revenue - prev_day_revenue) / NULLIF(prev_day_revenue, 0) * 100, 1) AS revenue_growth_pct,
        ROUND(revenue / NULLIF(unique_customers, 0), 2) AS revenue_per_customer
    FROM daily_with_prev
)

SELECT * FROM final ORDER BY order_date DESC;

CTE Patterns for Data Engineering

The Import-Transform-Export Pattern

This is the standard pattern used in dbt models — import your sources, transform step by step, then select the final result:

-- Import CTEs: pull in source data (one CTE per source)
WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),

customers AS (
    SELECT * FROM {{ ref('stg_customers') }}
),

products AS (
    SELECT * FROM {{ ref('stg_products') }}
),

-- Transform CTEs: apply business logic step by step
enriched_orders AS (
    SELECT
        o.order_id,
        o.order_date,
        o.customer_id,
        c.customer_name,
        c.segment,
        c.country,
        o.product_id,
        p.product_name,
        p.category,
        o.quantity,
        o.unit_price,
        o.total_amount
    FROM orders o
    LEFT JOIN customers c ON o.customer_id = c.customer_id
    LEFT JOIN products p ON o.product_id = p.product_id
),

with_metrics AS (
    SELECT
        *,
        SUM(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS customer_cumulative_spend,
        ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS customer_order_number,
        CASE
            WHEN total_amount > 500 THEN 'high'
            WHEN total_amount > 100 THEN 'medium'
            ELSE 'low'
        END AS value_tier
    FROM enriched_orders
),

-- Export CTE: final select
final AS (
    SELECT * FROM with_metrics
    WHERE total_amount > 0
)

SELECT * FROM final

Cohort Analysis with CTEs

-- Retention cohort analysis: track customer retention by signup month
WITH customer_first_order AS (
    -- Find each customer's first order month (cohort assignment)
    SELECT
        customer_id,
        DATE_TRUNC('month', MIN(order_date)) AS cohort_month
    FROM analytics.fct_orders
    GROUP BY customer_id
),

customer_activities AS (
    -- Track each customer's activity months
    SELECT DISTINCT
        o.customer_id,
        c.cohort_month,
        DATE_TRUNC('month', o.order_date) AS activity_month,
        DATEDIFF('month', c.cohort_month, DATE_TRUNC('month', o.order_date)) AS months_since_first
    FROM analytics.fct_orders o
    JOIN customer_first_order c ON o.customer_id = c.customer_id
),

cohort_sizes AS (
    -- Count customers in each cohort
    SELECT cohort_month, COUNT(DISTINCT customer_id) AS cohort_size
    FROM customer_first_order
    GROUP BY cohort_month
),

retention AS (
    -- Calculate retention rate for each cohort at each month
    SELECT
        a.cohort_month,
        a.months_since_first,
        COUNT(DISTINCT a.customer_id) AS active_customers,
        s.cohort_size,
        ROUND(COUNT(DISTINCT a.customer_id)::FLOAT / s.cohort_size * 100, 1) AS retention_pct
    FROM customer_activities a
    JOIN cohort_sizes s ON a.cohort_month = s.cohort_month
    GROUP BY a.cohort_month, a.months_since_first, s.cohort_size
)

SELECT * FROM retention
ORDER BY cohort_month, months_since_first;

Recursive CTEs

Recursive CTEs allow a CTE to reference itself, enabling queries over hierarchical or graph-like data structures. They are essential for traversing org charts, category trees, bill-of-materials, and any parent-child relationships stored in a single table.

-- Recursive CTE: Traverse an organizational hierarchy
-- Table: employees (id, name, title, manager_id)

WITH RECURSIVE org_tree AS (
    -- Base case: Start with the CEO (no manager)
    SELECT
        id,
        name,
        title,
        manager_id,
        0 AS level,
        name AS path
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive case: Find all direct reports
    SELECT
        e.id,
        e.name,
        e.title,
        e.manager_id,
        t.level + 1 AS level,
        t.path || ' > ' || e.name AS path
    FROM employees e
    JOIN org_tree t ON e.manager_id = t.id
)

SELECT
    REPEAT('  ', level) || name AS indented_name,
    title,
    level,
    path
FROM org_tree
ORDER BY path;

-- Output:
-- Alice            CEO              0  Alice
--   Bob            VP Engineering   1  Alice > Bob
--     Charlie      Staff Engineer   2  Alice > Bob > Charlie
--     Diana        Senior Engineer  2  Alice > Bob > Diana
--   Eve            VP Marketing     1  Alice > Eve
--     Frank        Marketing Lead   2  Alice > Eve > Frank

Recursive CTE: Category Breadcrumbs

-- Build full category paths from a self-referential categories table
-- Table: categories (id, name, parent_id)

WITH RECURSIVE category_paths AS (
    -- Root categories (no parent)
    SELECT
        id,
        name,
        parent_id,
        name::TEXT AS full_path,
        1 AS depth
    FROM categories
    WHERE parent_id IS NULL

    UNION ALL

    -- Child categories
    SELECT
        c.id,
        c.name,
        c.parent_id,
        cp.full_path || ' > ' || c.name AS full_path,
        cp.depth + 1
    FROM categories c
    JOIN category_paths cp ON c.parent_id = cp.id
)

SELECT id, name, full_path, depth
FROM category_paths
ORDER BY full_path;

-- Output:
-- 1  Electronics           Electronics                            1
-- 5  Computers             Electronics > Computers                2
-- 9  Laptops               Electronics > Computers > Laptops      3
-- 10 Desktops              Electronics > Computers > Desktops     3
-- 6  Audio                 Electronics > Audio                    2

Recursive CTE: Date Spine Generation

-- Generate a continuous date series (useful for filling gaps in time-series)
WITH RECURSIVE date_spine AS (
    SELECT DATE '2025-01-01' AS date_day

    UNION ALL

    SELECT date_day + INTERVAL '1 day'
    FROM date_spine
    WHERE date_day < DATE '2025-12-31'
)

-- Use the date spine to fill gaps in daily revenue
SELECT
    d.date_day,
    COALESCE(r.revenue, 0) AS revenue,
    COALESCE(r.order_count, 0) AS order_count
FROM date_spine d
LEFT JOIN daily_revenue r ON d.date_day = r.order_date
ORDER BY d.date_day;

Advanced CTE Techniques

-- Materialized CTE hint (supported in some databases)
-- Forces intermediate result to be stored temporarily
WITH customer_orders AS MATERIALIZED (
    SELECT customer_id, COUNT(*) AS order_count, SUM(amount) AS total_spent
    FROM orders
    GROUP BY customer_id
)
SELECT * FROM customer_orders WHERE order_count > 5
UNION ALL
SELECT * FROM customer_orders WHERE total_spent > 1000;

-- Chained CTEs for funnel analysis
WITH page_views AS (
    SELECT DISTINCT user_id FROM events WHERE event_type = 'page_view'
),
add_to_cart AS (
    SELECT DISTINCT user_id FROM events WHERE event_type = 'add_to_cart'
),
checkout AS (
    SELECT DISTINCT user_id FROM events WHERE event_type = 'checkout'
),
purchase AS (
    SELECT DISTINCT user_id FROM events WHERE event_type = 'purchase'
),
funnel AS (
    SELECT
        (SELECT COUNT(*) FROM page_views) AS viewed,
        (SELECT COUNT(*) FROM add_to_cart) AS added,
        (SELECT COUNT(*) FROM checkout) AS checked_out,
        (SELECT COUNT(*) FROM purchase) AS purchased
)
SELECT
    viewed,
    added,
    ROUND(added::FLOAT / viewed * 100, 1) AS view_to_cart_pct,
    checked_out,
    ROUND(checked_out::FLOAT / added * 100, 1) AS cart_to_checkout_pct,
    purchased,
    ROUND(purchased::FLOAT / checked_out * 100, 1) AS checkout_to_purchase_pct,
    ROUND(purchased::FLOAT / viewed * 100, 1) AS overall_conversion_pct
FROM funnel;

Key Takeaways

  • CTEs make complex queries readable by breaking them into named, logical steps
  • Follow the import-transform-export pattern used in dbt: source CTEs first, then transformations, then final select
  • Recursive CTEs traverse hierarchical data (org charts, category trees) by referencing themselves
  • Date spines fill gaps in time-series data — essential for accurate daily/weekly/monthly reports
  • CTEs are not automatically materialized — use explicit MATERIALIZED hints or temp tables for performance-critical intermediate results
  • Most dbt models are structured as a series of CTEs — mastering CTEs is mastering dbt

Continue Learning