TechLead
Lesson 11 of 22
7 min read
Data Engineering

Data Modeling Fundamentals

Learn the principles of data modeling for analytics, including normalization, denormalization, and choosing the right model for your use case

What is Data Modeling?

Data modeling is the process of creating a structured representation of data and its relationships within a system. In data engineering, data modeling specifically focuses on designing schemas for analytical workloads — optimizing for query performance, ease of understanding, and maintainability. A good data model is the difference between a warehouse where analysts can self-serve and one where every question requires an engineer to write a custom query.

Data modeling for analytics differs significantly from modeling for operational (OLTP) systems. Operational databases optimize for fast reads and writes of individual records, favoring normalization to avoid data anomalies. Analytical databases optimize for scanning and aggregating large volumes of data, often favoring denormalization to reduce the number of joins needed for common queries.

OLTP vs OLAP Modeling

  • OLTP (Operational): Normalized (3NF), optimized for individual record reads/writes, small transactions, low latency. Used by applications.
  • OLAP (Analytical): Denormalized (star/snowflake schema), optimized for aggregation queries over large datasets, read-heavy. Used by analysts and dashboards.

Normalization

Normalization organizes data to minimize redundancy and prevent update anomalies. Data is split into multiple tables linked by foreign keys. While great for operational databases, heavy normalization in analytical systems leads to excessive joins and poor query performance.

-- Normalized schema (3NF): No data redundancy but requires many joins

-- Customers table
CREATE TABLE customers (
    customer_id   SERIAL PRIMARY KEY,
    name          VARCHAR(100) NOT NULL,
    email         VARCHAR(255) UNIQUE NOT NULL,
    country_id    INT REFERENCES countries(country_id)
);

-- Countries table (avoids repeating country names)
CREATE TABLE countries (
    country_id    SERIAL PRIMARY KEY,
    country_name  VARCHAR(100) NOT NULL,
    region        VARCHAR(50) NOT NULL
);

-- Orders table
CREATE TABLE orders (
    order_id      SERIAL PRIMARY KEY,
    customer_id   INT REFERENCES customers(customer_id),
    order_date    DATE NOT NULL,
    status        VARCHAR(20) NOT NULL
);

-- Order items (line items separate from orders)
CREATE TABLE order_items (
    item_id       SERIAL PRIMARY KEY,
    order_id      INT REFERENCES orders(order_id),
    product_id    INT REFERENCES products(product_id),
    quantity      INT NOT NULL,
    unit_price    DECIMAL(10,2) NOT NULL
);

-- To get "revenue by country" requires joining 4 tables:
SELECT
    co.country_name,
    SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
JOIN customers c ON o.customer_id = c.customer_id
JOIN countries co ON c.country_id = co.country_id
WHERE o.order_date >= '2025-01-01'
GROUP BY co.country_name;

Denormalization

Denormalization intentionally adds redundancy to reduce joins and improve query performance. In analytical contexts, this is usually the right trade-off because data is written once (by pipelines) and read many times (by analysts and dashboards). The cost of storing extra copies of data is far less than the cost of complex multi-table joins on every query.

-- Denormalized: Wide table with pre-joined data
-- Optimized for analytical queries — single table scan

CREATE TABLE analytics.orders_wide (
    order_id          BIGINT PRIMARY KEY,
    order_date        DATE NOT NULL,
    order_month       VARCHAR(7),       -- '2025-03'
    status            VARCHAR(20),

    -- Customer fields (denormalized from customers table)
    customer_id       BIGINT,
    customer_name     VARCHAR(100),
    customer_email    VARCHAR(255),
    customer_country  VARCHAR(100),
    customer_region   VARCHAR(50),

    -- Product fields (denormalized from products table)
    product_id        BIGINT,
    product_name      VARCHAR(200),
    product_category  VARCHAR(100),

    -- Metrics
    quantity          INT,
    unit_price        DECIMAL(10,2),
    total_amount      DECIMAL(12,2),
    discount_amount   DECIMAL(10,2) DEFAULT 0
);

-- Same query: "revenue by country" — now a simple single-table scan!
SELECT
    customer_country,
    SUM(total_amount) AS revenue
FROM analytics.orders_wide
WHERE order_date >= '2025-01-01'
GROUP BY customer_country;

-- No joins needed. Columnar warehouses only read the columns
-- actually referenced in the query, making this very fast.

Modeling Approaches Compared

Approach Structure Pros Cons
3NF NormalizedMany tables, no redundancyNo anomalies, DRYSlow queries, many joins
Star SchemaFact + dimension tablesIntuitive, good performanceSome redundancy in dimensions
Wide TableSingle denormalized tableFastest queries, simplestHigh redundancy, harder updates
Data VaultHubs, links, satellitesFlexible, auditable, scalableComplex, steep learning curve

Entity-Relationship Modeling

Before writing any SQL, you should map out the entities (things) in your domain and their relationships. This conceptual model guides your physical schema design. Key relationship types are:

  • One-to-One: Each customer has one profile. Model as columns on the same table or a separate table with a unique foreign key.
  • One-to-Many: Each customer has many orders. The "many" side (orders) holds a foreign key to the "one" side (customers).
  • Many-to-Many: Orders contain many products; products appear in many orders. Requires a junction table (order_items).

The dbt Modeling Pattern

Modern analytics engineering uses a layered approach to data modeling, popularized by dbt:

-- Layer 1: Staging (stg_) — Clean and rename raw data
-- One staging model per source table
-- models/staging/stg_orders.sql
SELECT
    id AS order_id,
    customer_id,
    product_id,
    qty AS quantity,
    price AS unit_price,
    qty * price AS total_amount,
    CAST(created_at AS DATE) AS order_date,
    status,
    _loaded_at
FROM raw.app_orders
WHERE id IS NOT NULL;

-- Layer 2: Intermediate (int_) — Business logic and joins
-- models/intermediate/int_orders_enriched.sql
SELECT
    o.order_id,
    o.customer_id,
    c.customer_name,
    c.country,
    c.segment,
    o.product_id,
    p.product_name,
    p.category,
    o.quantity,
    o.unit_price,
    o.total_amount,
    o.order_date,
    o.status
FROM staging.stg_orders o
LEFT JOIN staging.stg_customers c ON o.customer_id = c.customer_id
LEFT JOIN staging.stg_products p ON o.product_id = p.product_id;

-- Layer 3: Marts (fct_, dim_) — Final business-facing tables
-- models/marts/fct_orders.sql
SELECT
    order_id,
    customer_id,
    product_id,
    quantity,
    unit_price,
    total_amount,
    order_date,
    DATE_TRUNC('month', order_date) AS order_month,
    status,
    CASE
        WHEN total_amount > 500 THEN 'premium'
        WHEN total_amount > 100 THEN 'standard'
        ELSE 'basic'
    END AS order_tier
FROM intermediate.int_orders_enriched
WHERE status != 'cancelled';

Data Vault Modeling

Data Vault is an alternative modeling approach designed for enterprise data warehouses that need to handle rapidly changing source systems, full historical tracking, and auditable data lineage. It uses three types of tables: Hubs (business keys), Links (relationships between hubs), and Satellites (descriptive attributes with history). Data Vault is more complex than star schemas but excels in environments where source systems change frequently.

-- Data Vault: Hub, Link, Satellite pattern

-- Hub: Stores unique business keys
CREATE TABLE vault.hub_customer (
    hub_customer_key    CHAR(32) PRIMARY KEY,  -- MD5 hash of business key
    customer_id         BIGINT NOT NULL,        -- Business key
    load_date           TIMESTAMP NOT NULL,
    record_source       VARCHAR(50) NOT NULL    -- Which source system
);

-- Satellite: Stores descriptive attributes with history
CREATE TABLE vault.sat_customer_details (
    hub_customer_key    CHAR(32) NOT NULL,
    load_date           TIMESTAMP NOT NULL,
    load_end_date       TIMESTAMP DEFAULT '9999-12-31',
    customer_name       VARCHAR(200),
    email               VARCHAR(255),
    city                VARCHAR(100),
    country             VARCHAR(100),
    record_source       VARCHAR(50),
    hash_diff           CHAR(32),  -- Hash of all attributes for change detection
    PRIMARY KEY (hub_customer_key, load_date)
);

-- Link: Stores relationships between hubs
CREATE TABLE vault.link_order_customer (
    link_order_customer_key  CHAR(32) PRIMARY KEY,
    hub_order_key            CHAR(32) NOT NULL,
    hub_customer_key         CHAR(32) NOT NULL,
    load_date                TIMESTAMP NOT NULL,
    record_source            VARCHAR(50) NOT NULL
);

Naming Conventions

Consistent naming conventions make data models self-documenting and easier to navigate. Every team should agree on naming standards before building their first model:

Recommended Naming Conventions

  • Staging models: stg_[source]_[table] — e.g., stg_stripe_payments
  • Intermediate models: int_[description] — e.g., int_orders_enriched
  • Fact tables: fct_[business_process] — e.g., fct_orders, fct_page_views
  • Dimension tables: dim_[entity] — e.g., dim_customer, dim_product
  • Columns: Use snake_case, be descriptive (total_amount_usd not amt), suffix with units
  • Boolean columns: Prefix with is_ or has_ — e.g., is_active, has_discount
  • Date columns: Suffix with _date or _at — e.g., order_date, created_at

Key Takeaways

  • Data modeling for analytics optimizes for read performance and ease of understanding, not write performance
  • Normalization minimizes redundancy but requires many joins; denormalization adds redundancy but improves query speed
  • Star schemas strike a good balance between structure and performance for most analytical use cases
  • Wide denormalized tables are fastest for queries but harder to maintain and update
  • Data Vault is ideal for complex enterprise environments with rapidly changing sources and strict audit requirements
  • Use the dbt layered pattern: staging (clean raw data), intermediate (business logic), marts (final tables)
  • Consistent naming conventions (stg_, int_, fct_, dim_) make models self-documenting and navigable

Continue Learning