What Are Window Functions?
Window functions perform calculations across a set of rows related to the current row — a "window" of data. Unlike GROUP BY (which collapses rows), window functions retain all rows while adding computed values. They are the most powerful analytical SQL feature and are essential for data engineering tasks like ranking, running totals, gap analysis, deduplication, and time-series comparisons.
Window functions are supported by all major databases (PostgreSQL, Snowflake, BigQuery, Redshift, Databricks) and follow the same SQL standard syntax. Mastering window functions transforms you from someone who writes basic aggregations to someone who can solve complex analytical problems in a single query.
Window Function Syntax
-- General syntax:
-- function_name(args) OVER (
-- [PARTITION BY column1, column2, ...] -- Optional: divide into groups
-- [ORDER BY column3 [ASC|DESC], ...] -- Optional: define row ordering
-- [ROWS/RANGE BETWEEN ... AND ...] -- Optional: define window frame
-- )
-- Simple example: Add a running total alongside each row
SELECT
order_date,
order_id,
total_amount,
SUM(total_amount) OVER (ORDER BY order_date) AS running_total,
AVG(total_amount) OVER (ORDER BY order_date) AS running_avg,
COUNT(*) OVER () AS total_orders -- No ORDER BY = entire result set
FROM orders;
Ranking Functions
Ranking functions assign a rank to each row within a partition. They are essential for deduplication, top-N queries, and identifying extremes within groups.
-- ROW_NUMBER: Unique sequential number per partition
-- RANK: Same rank for ties, skips next rank
-- DENSE_RANK: Same rank for ties, does not skip
SELECT
customer_id,
order_id,
total_amount,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY total_amount DESC) AS rn,
RANK() OVER (PARTITION BY customer_id ORDER BY total_amount DESC) AS rnk,
DENSE_RANK() OVER (PARTITION BY customer_id ORDER BY total_amount DESC) AS dense_rnk
FROM orders;
-- Example output for customer 501:
-- | customer_id | order_id | amount | rn | rnk | dense_rnk |
-- | 501 | 1003 | 200.00 | 1 | 1 | 1 |
-- | 501 | 1007 | 200.00 | 2 | 1 | 1 | -- Tie!
-- | 501 | 1001 | 150.00 | 3 | 3 | 2 |
-- | 501 | 1005 | 100.00 | 4 | 4 | 3 |
-- DEDUPLICATION: Keep only the most recent record per customer
-- This is one of the most common data engineering patterns
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY updated_at DESC
) AS rn
FROM raw.customers
)
SELECT * FROM ranked WHERE rn = 1;
-- TOP-N: Find each customer's 3 most expensive orders
WITH ranked_orders AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY total_amount DESC
) AS order_rank
FROM orders
)
SELECT * FROM ranked_orders WHERE order_rank <= 3;
LAG and LEAD: Row-to-Row Comparisons
LAG accesses a previous row's value; LEAD accesses a future row's value. These are invaluable for calculating changes, growth rates, and detecting patterns across consecutive rows.
-- Month-over-month revenue growth
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(total_amount) AS revenue
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
)
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
revenue - LAG(revenue) OVER (ORDER BY month) AS revenue_change,
ROUND(
(revenue - LAG(revenue) OVER (ORDER BY month))
/ NULLIF(LAG(revenue) OVER (ORDER BY month), 0) * 100, 1
) AS growth_pct
FROM monthly_revenue
ORDER BY month;
-- Customer purchase frequency: days between orders
SELECT
customer_id,
order_date,
total_amount,
LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_order_date,
order_date - LAG(order_date) OVER (
PARTITION BY customer_id ORDER BY order_date
) AS days_between_orders,
LEAD(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS next_order_date
FROM orders
ORDER BY customer_id, order_date;
Running Totals and Moving Averages
-- Running total: cumulative revenue over time
SELECT
order_date,
daily_revenue,
SUM(daily_revenue) OVER (ORDER BY order_date) AS cumulative_revenue,
SUM(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total -- Explicit frame (same result)
FROM daily_revenue_summary;
-- 7-day moving average: smooth out daily fluctuations
SELECT
order_date,
daily_revenue,
AVG(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7d,
-- 30-day moving average
AVG(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
) AS moving_avg_30d
FROM daily_revenue_summary;
-- Percentage of total: each category's share of revenue
SELECT
category,
SUM(total_amount) AS category_revenue,
SUM(total_amount) / SUM(SUM(total_amount)) OVER () * 100 AS pct_of_total,
-- Cumulative percentage (for Pareto analysis)
SUM(SUM(total_amount)) OVER (ORDER BY SUM(total_amount) DESC)
/ SUM(SUM(total_amount)) OVER () * 100 AS cumulative_pct
FROM orders
GROUP BY category
ORDER BY category_revenue DESC;
FIRST_VALUE, LAST_VALUE, NTH_VALUE
-- FIRST_VALUE: First order amount per customer
-- LAST_VALUE: Most recent order amount per customer
SELECT
customer_id,
order_date,
total_amount,
FIRST_VALUE(total_amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
) AS first_order_amount,
LAST_VALUE(total_amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS latest_order_amount,
-- Note: LAST_VALUE requires explicit frame to include all rows
total_amount - FIRST_VALUE(total_amount) OVER (
PARTITION BY customer_id ORDER BY order_date
) AS change_from_first
FROM orders;
Window Frames
Window frames define exactly which rows are included in the window calculation relative to the current row. Understanding frames is essential for correct running totals and moving averages.
Frame Specifications
- ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: All rows from start to current (running total). This is the default when ORDER BY is specified.
- ROWS BETWEEN 6 PRECEDING AND CURRENT ROW: Current row plus 6 preceding (7-day moving average)
- ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: All rows in the partition (for LAST_VALUE or partition-level aggregates)
- ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING: Current row plus next 3 (forward-looking window)
- RANGE vs ROWS: ROWS counts physical rows; RANGE groups rows with the same ORDER BY value. Use ROWS for predictable results.
Practical Data Engineering Examples
-- Session identification: Group clicks into sessions
-- (new session if gap > 30 minutes)
WITH clicks_with_gap AS (
SELECT
user_id,
event_timestamp,
LAG(event_timestamp) OVER (
PARTITION BY user_id ORDER BY event_timestamp
) AS prev_timestamp,
CASE
WHEN LAG(event_timestamp) OVER (
PARTITION BY user_id ORDER BY event_timestamp
) IS NULL THEN 1
WHEN DATEDIFF('minute', LAG(event_timestamp) OVER (
PARTITION BY user_id ORDER BY event_timestamp
), event_timestamp) > 30 THEN 1
ELSE 0
END AS is_new_session
FROM clickstream
),
sessions AS (
SELECT *,
SUM(is_new_session) OVER (
PARTITION BY user_id
ORDER BY event_timestamp
) AS session_id
FROM clicks_with_gap
)
SELECT
user_id,
session_id,
MIN(event_timestamp) AS session_start,
MAX(event_timestamp) AS session_end,
COUNT(*) AS page_views
FROM sessions
GROUP BY user_id, session_id;
Key Takeaways
- Window functions compute values across related rows without collapsing them — unlike GROUP BY
- ROW_NUMBER is essential for deduplication — the most common data engineering use of window functions
- LAG/LEAD enable row-to-row comparisons: growth rates, gaps, and sequential analysis
- Running totals and moving averages use SUM/AVG with window frames (ROWS BETWEEN)
- Always specify explicit window frames for LAST_VALUE — the default frame does not include all rows
- Window functions are supported identically across PostgreSQL, Snowflake, BigQuery, and Redshift