What is Advanced SQL: Window Functions?

Master SQL window functions including ROW_NUMBER, RANK, LAG/LEAD, running totals, and moving averages for analytical queries

Advanced SQL: Window Functions - Data Engineering Tutorial | TechLead

What Are Window Functions?

Window functions perform calculations across a set of rows related to the current row — a "window" of data. Unlike GROUP BY (which collapses rows), window functions retain all rows while adding computed values. They are the most powerful analytical SQL feature and are essential for data engineering tasks like ranking, running totals, gap analysis, deduplication, and time-series comparisons.

Window functions are supported by all major databases (PostgreSQL, Snowflake, BigQuery, Redshift, Databricks) and follow the same SQL standard syntax. Mastering window functions transforms you from someone who writes basic aggregations to someone who can solve complex analytical problems in a single query.

Window Function Syntax

-- General syntax:
-- function_name(args) OVER (
--   [PARTITION BY column1, column2, ...]   -- Optional: divide into groups
--   [ORDER BY column3 [ASC|DESC], ...]     -- Optional: define row ordering
--   [ROWS/RANGE BETWEEN ... AND ...]       -- Optional: define window frame
-- )

-- Simple example: Add a running total alongside each row
SELECT
    order_date,
    order_id,
    total_amount,
    SUM(total_amount) OVER (ORDER BY order_date) AS running_total,
    AVG(total_amount) OVER (ORDER BY order_date) AS running_avg,
    COUNT(*) OVER () AS total_orders  -- No ORDER BY = entire result set
FROM orders;

Ranking Functions

Ranking functions assign a rank to each row within a partition. They are essential for deduplication, top-N queries, and identifying extremes within groups.

-- ROW_NUMBER: Unique sequential number per partition
-- RANK: Same rank for ties, skips next rank
-- DENSE_RANK: Same rank for ties, does not skip

SELECT
    customer_id,
    order_id,
    total_amount,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY total_amount DESC) AS rn,
    RANK()       OVER (PARTITION BY customer_id ORDER BY total_amount DESC) AS rnk,
    DENSE_RANK() OVER (PARTITION BY customer_id ORDER BY total_amount DESC) AS dense_rnk
FROM orders;

-- Example output for customer 501:
-- | customer_id | order_id | amount | rn | rnk | dense_rnk |
-- | 501         | 1003     | 200.00 | 1  | 1   | 1         |
-- | 501         | 1007     | 200.00 | 2  | 1   | 1         |  -- Tie!
-- | 501         | 1001     | 150.00 | 3  | 3   | 2         |
-- | 501         | 1005     | 100.00 | 4  | 4   | 3         |

-- DEDUPLICATION: Keep only the most recent record per customer
-- This is one of the most common data engineering patterns
WITH ranked AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY updated_at DESC
        ) AS rn
    FROM raw.customers
)
SELECT * FROM ranked WHERE rn = 1;

-- TOP-N: Find each customer's 3 most expensive orders
WITH ranked_orders AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY total_amount DESC
        ) AS order_rank
    FROM orders
)
SELECT * FROM ranked_orders WHERE order_rank <= 3;

LAG and LEAD: Row-to-Row Comparisons

LAG accesses a previous row's value; LEAD accesses a future row's value. These are invaluable for calculating changes, growth rates, and detecting patterns across consecutive rows.

-- Month-over-month revenue growth
WITH monthly_revenue AS (
    SELECT
        DATE_TRUNC('month', order_date) AS month,
        SUM(total_amount) AS revenue
    FROM orders
    GROUP BY DATE_TRUNC('month', order_date)
)
SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
    revenue - LAG(revenue) OVER (ORDER BY month) AS revenue_change,
    ROUND(
        (revenue - LAG(revenue) OVER (ORDER BY month))
        / NULLIF(LAG(revenue) OVER (ORDER BY month), 0) * 100, 1
    ) AS growth_pct
FROM monthly_revenue
ORDER BY month;

-- Customer purchase frequency: days between orders
SELECT
    customer_id,
    order_date,
    total_amount,
    LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_order_date,
    order_date - LAG(order_date) OVER (
        PARTITION BY customer_id ORDER BY order_date
    ) AS days_between_orders,
    LEAD(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS next_order_date
FROM orders
ORDER BY customer_id, order_date;

Running Totals and Moving Averages

-- Running total: cumulative revenue over time
SELECT
    order_date,
    daily_revenue,
    SUM(daily_revenue) OVER (ORDER BY order_date) AS cumulative_revenue,
    SUM(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total  -- Explicit frame (same result)
FROM daily_revenue_summary;

-- 7-day moving average: smooth out daily fluctuations
SELECT
    order_date,
    daily_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d,
    -- 30-day moving average
    AVG(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
    ) AS moving_avg_30d
FROM daily_revenue_summary;

-- Percentage of total: each category's share of revenue
SELECT
    category,
    SUM(total_amount) AS category_revenue,
    SUM(total_amount) / SUM(SUM(total_amount)) OVER () * 100 AS pct_of_total,
    -- Cumulative percentage (for Pareto analysis)
    SUM(SUM(total_amount)) OVER (ORDER BY SUM(total_amount) DESC)
        / SUM(SUM(total_amount)) OVER () * 100 AS cumulative_pct
FROM orders
GROUP BY category
ORDER BY category_revenue DESC;

FIRST_VALUE, LAST_VALUE, NTH_VALUE

-- FIRST_VALUE: First order amount per customer
-- LAST_VALUE: Most recent order amount per customer
SELECT
    customer_id,
    order_date,
    total_amount,
    FIRST_VALUE(total_amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
    ) AS first_order_amount,
    LAST_VALUE(total_amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS latest_order_amount,
    -- Note: LAST_VALUE requires explicit frame to include all rows
    total_amount - FIRST_VALUE(total_amount) OVER (
        PARTITION BY customer_id ORDER BY order_date
    ) AS change_from_first
FROM orders;

Window Frames

Window frames define exactly which rows are included in the window calculation relative to the current row. Understanding frames is essential for correct running totals and moving averages.

Frame Specifications

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: All rows from start to current (running total). This is the default when ORDER BY is specified.
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW: Current row plus 6 preceding (7-day moving average)
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: All rows in the partition (for LAST_VALUE or partition-level aggregates)
ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING: Current row plus next 3 (forward-looking window)
RANGE vs ROWS: ROWS counts physical rows; RANGE groups rows with the same ORDER BY value. Use ROWS for predictable results.

Practical Data Engineering Examples

-- Session identification: Group clicks into sessions
-- (new session if gap > 30 minutes)
WITH clicks_with_gap AS (
    SELECT
        user_id,
        event_timestamp,
        LAG(event_timestamp) OVER (
            PARTITION BY user_id ORDER BY event_timestamp
        ) AS prev_timestamp,
        CASE
            WHEN LAG(event_timestamp) OVER (
                PARTITION BY user_id ORDER BY event_timestamp
            ) IS NULL THEN 1
            WHEN DATEDIFF('minute', LAG(event_timestamp) OVER (
                PARTITION BY user_id ORDER BY event_timestamp
            ), event_timestamp) > 30 THEN 1
            ELSE 0
        END AS is_new_session
    FROM clickstream
),
sessions AS (
    SELECT *,
        SUM(is_new_session) OVER (
            PARTITION BY user_id
            ORDER BY event_timestamp
        ) AS session_id
    FROM clicks_with_gap
)
SELECT
    user_id,
    session_id,
    MIN(event_timestamp) AS session_start,
    MAX(event_timestamp) AS session_end,
    COUNT(*) AS page_views
FROM sessions
GROUP BY user_id, session_id;

Key Takeaways

Window functions compute values across related rows without collapsing them — unlike GROUP BY
ROW_NUMBER is essential for deduplication — the most common data engineering use of window functions
LAG/LEAD enable row-to-row comparisons: growth rates, gaps, and sequential analysis
Running totals and moving averages use SUM/AVG with window frames (ROWS BETWEEN)
Always specify explicit window frames for LAST_VALUE — the default frame does not include all rows
Window functions are supported identically across PostgreSQL, Snowflake, BigQuery, and Redshift

Advanced SQL: Window Functions

What Are Window Functions?

Window Function Syntax

Ranking Functions

LAG and LEAD: Row-to-Row Comparisons

Running Totals and Moving Averages

FIRST_VALUE, LAST_VALUE, NTH_VALUE

Window Frames

Frame Specifications

Practical Data Engineering Examples

Key Takeaways

Continue Learning

Python

PostgreSQL

System Design

Cloud & Kubernetes

AI Agents & RAG