What is Data Modeling?
Data modeling is the process of creating a structured representation of data and its relationships within a system. In data engineering, data modeling specifically focuses on designing schemas for analytical workloads — optimizing for query performance, ease of understanding, and maintainability. A good data model is the difference between a warehouse where analysts can self-serve and one where every question requires an engineer to write a custom query.
Data modeling for analytics differs significantly from modeling for operational (OLTP) systems. Operational databases optimize for fast reads and writes of individual records, favoring normalization to avoid data anomalies. Analytical databases optimize for scanning and aggregating large volumes of data, often favoring denormalization to reduce the number of joins needed for common queries.
OLTP vs OLAP Modeling
- OLTP (Operational): Normalized (3NF), optimized for individual record reads/writes, small transactions, low latency. Used by applications.
- OLAP (Analytical): Denormalized (star/snowflake schema), optimized for aggregation queries over large datasets, read-heavy. Used by analysts and dashboards.
Normalization
Normalization organizes data to minimize redundancy and prevent update anomalies. Data is split into multiple tables linked by foreign keys. While great for operational databases, heavy normalization in analytical systems leads to excessive joins and poor query performance.
-- Normalized schema (3NF): No data redundancy but requires many joins
-- Customers table
CREATE TABLE customers (
customer_id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
country_id INT REFERENCES countries(country_id)
);
-- Countries table (avoids repeating country names)
CREATE TABLE countries (
country_id SERIAL PRIMARY KEY,
country_name VARCHAR(100) NOT NULL,
region VARCHAR(50) NOT NULL
);
-- Orders table
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
customer_id INT REFERENCES customers(customer_id),
order_date DATE NOT NULL,
status VARCHAR(20) NOT NULL
);
-- Order items (line items separate from orders)
CREATE TABLE order_items (
item_id SERIAL PRIMARY KEY,
order_id INT REFERENCES orders(order_id),
product_id INT REFERENCES products(product_id),
quantity INT NOT NULL,
unit_price DECIMAL(10,2) NOT NULL
);
-- To get "revenue by country" requires joining 4 tables:
SELECT
co.country_name,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
JOIN customers c ON o.customer_id = c.customer_id
JOIN countries co ON c.country_id = co.country_id
WHERE o.order_date >= '2025-01-01'
GROUP BY co.country_name;
Denormalization
Denormalization intentionally adds redundancy to reduce joins and improve query performance. In analytical contexts, this is usually the right trade-off because data is written once (by pipelines) and read many times (by analysts and dashboards). The cost of storing extra copies of data is far less than the cost of complex multi-table joins on every query.
-- Denormalized: Wide table with pre-joined data
-- Optimized for analytical queries — single table scan
CREATE TABLE analytics.orders_wide (
order_id BIGINT PRIMARY KEY,
order_date DATE NOT NULL,
order_month VARCHAR(7), -- '2025-03'
status VARCHAR(20),
-- Customer fields (denormalized from customers table)
customer_id BIGINT,
customer_name VARCHAR(100),
customer_email VARCHAR(255),
customer_country VARCHAR(100),
customer_region VARCHAR(50),
-- Product fields (denormalized from products table)
product_id BIGINT,
product_name VARCHAR(200),
product_category VARCHAR(100),
-- Metrics
quantity INT,
unit_price DECIMAL(10,2),
total_amount DECIMAL(12,2),
discount_amount DECIMAL(10,2) DEFAULT 0
);
-- Same query: "revenue by country" — now a simple single-table scan!
SELECT
customer_country,
SUM(total_amount) AS revenue
FROM analytics.orders_wide
WHERE order_date >= '2025-01-01'
GROUP BY customer_country;
-- No joins needed. Columnar warehouses only read the columns
-- actually referenced in the query, making this very fast.
Modeling Approaches Compared
| Approach | Structure | Pros | Cons |
|---|---|---|---|
| 3NF Normalized | Many tables, no redundancy | No anomalies, DRY | Slow queries, many joins |
| Star Schema | Fact + dimension tables | Intuitive, good performance | Some redundancy in dimensions |
| Wide Table | Single denormalized table | Fastest queries, simplest | High redundancy, harder updates |
| Data Vault | Hubs, links, satellites | Flexible, auditable, scalable | Complex, steep learning curve |
Entity-Relationship Modeling
Before writing any SQL, you should map out the entities (things) in your domain and their relationships. This conceptual model guides your physical schema design. Key relationship types are:
- One-to-One: Each customer has one profile. Model as columns on the same table or a separate table with a unique foreign key.
- One-to-Many: Each customer has many orders. The "many" side (orders) holds a foreign key to the "one" side (customers).
- Many-to-Many: Orders contain many products; products appear in many orders. Requires a junction table (order_items).
The dbt Modeling Pattern
Modern analytics engineering uses a layered approach to data modeling, popularized by dbt:
-- Layer 1: Staging (stg_) — Clean and rename raw data
-- One staging model per source table
-- models/staging/stg_orders.sql
SELECT
id AS order_id,
customer_id,
product_id,
qty AS quantity,
price AS unit_price,
qty * price AS total_amount,
CAST(created_at AS DATE) AS order_date,
status,
_loaded_at
FROM raw.app_orders
WHERE id IS NOT NULL;
-- Layer 2: Intermediate (int_) — Business logic and joins
-- models/intermediate/int_orders_enriched.sql
SELECT
o.order_id,
o.customer_id,
c.customer_name,
c.country,
c.segment,
o.product_id,
p.product_name,
p.category,
o.quantity,
o.unit_price,
o.total_amount,
o.order_date,
o.status
FROM staging.stg_orders o
LEFT JOIN staging.stg_customers c ON o.customer_id = c.customer_id
LEFT JOIN staging.stg_products p ON o.product_id = p.product_id;
-- Layer 3: Marts (fct_, dim_) — Final business-facing tables
-- models/marts/fct_orders.sql
SELECT
order_id,
customer_id,
product_id,
quantity,
unit_price,
total_amount,
order_date,
DATE_TRUNC('month', order_date) AS order_month,
status,
CASE
WHEN total_amount > 500 THEN 'premium'
WHEN total_amount > 100 THEN 'standard'
ELSE 'basic'
END AS order_tier
FROM intermediate.int_orders_enriched
WHERE status != 'cancelled';
Data Vault Modeling
Data Vault is an alternative modeling approach designed for enterprise data warehouses that need to handle rapidly changing source systems, full historical tracking, and auditable data lineage. It uses three types of tables: Hubs (business keys), Links (relationships between hubs), and Satellites (descriptive attributes with history). Data Vault is more complex than star schemas but excels in environments where source systems change frequently.
-- Data Vault: Hub, Link, Satellite pattern
-- Hub: Stores unique business keys
CREATE TABLE vault.hub_customer (
hub_customer_key CHAR(32) PRIMARY KEY, -- MD5 hash of business key
customer_id BIGINT NOT NULL, -- Business key
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(50) NOT NULL -- Which source system
);
-- Satellite: Stores descriptive attributes with history
CREATE TABLE vault.sat_customer_details (
hub_customer_key CHAR(32) NOT NULL,
load_date TIMESTAMP NOT NULL,
load_end_date TIMESTAMP DEFAULT '9999-12-31',
customer_name VARCHAR(200),
email VARCHAR(255),
city VARCHAR(100),
country VARCHAR(100),
record_source VARCHAR(50),
hash_diff CHAR(32), -- Hash of all attributes for change detection
PRIMARY KEY (hub_customer_key, load_date)
);
-- Link: Stores relationships between hubs
CREATE TABLE vault.link_order_customer (
link_order_customer_key CHAR(32) PRIMARY KEY,
hub_order_key CHAR(32) NOT NULL,
hub_customer_key CHAR(32) NOT NULL,
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(50) NOT NULL
);
Naming Conventions
Consistent naming conventions make data models self-documenting and easier to navigate. Every team should agree on naming standards before building their first model:
Recommended Naming Conventions
- Staging models:
stg_[source]_[table]— e.g.,stg_stripe_payments - Intermediate models:
int_[description]— e.g.,int_orders_enriched - Fact tables:
fct_[business_process]— e.g.,fct_orders,fct_page_views - Dimension tables:
dim_[entity]— e.g.,dim_customer,dim_product - Columns: Use snake_case, be descriptive (
total_amount_usdnotamt), suffix with units - Boolean columns: Prefix with
is_orhas_— e.g.,is_active,has_discount - Date columns: Suffix with
_dateor_at— e.g.,order_date,created_at
Key Takeaways
- Data modeling for analytics optimizes for read performance and ease of understanding, not write performance
- Normalization minimizes redundancy but requires many joins; denormalization adds redundancy but improves query speed
- Star schemas strike a good balance between structure and performance for most analytical use cases
- Wide denormalized tables are fastest for queries but harder to maintain and update
- Data Vault is ideal for complex enterprise environments with rapidly changing sources and strict audit requirements
- Use the dbt layered pattern: staging (clean raw data), intermediate (business logic), marts (final tables)
- Consistent naming conventions (stg_, int_, fct_, dim_) make models self-documenting and navigable