TechLead
Lesson 19 of 22
7 min read
Data Engineering

Data Governance

Implement data governance practices including access control, data cataloging, lineage tracking, privacy compliance, and organizational policies

What is Data Governance?

Data governance is the framework of policies, processes, and standards that ensures data is managed as a valuable organizational asset. It encompasses data security, privacy compliance, access controls, data cataloging, lineage tracking, and ownership policies. Good governance makes data trustworthy, discoverable, and compliant with regulations — without it, data becomes a liability rather than an asset.

Data governance is not just a compliance checkbox — it is an enabler of data democratization. When governance is done well, more people can safely access and use data because clear policies define who can see what, how data should be used, and what quality standards it meets. Without governance, organizations either restrict data access too tightly (killing productivity) or too loosely (risking breaches and compliance violations).

Pillars of Data Governance

  • Data Security: Protecting data from unauthorized access through encryption, authentication, and access controls
  • Privacy Compliance: Meeting GDPR, CCPA, HIPAA, and other regulatory requirements for personal data
  • Data Cataloging: Making data discoverable through metadata management, documentation, and search
  • Data Lineage: Tracking how data flows from sources through transformations to final consumption
  • Data Ownership: Assigning clear accountability for data quality, accuracy, and maintenance
  • Data Quality: Establishing and enforcing standards for accuracy, completeness, and timeliness

Access Control

Access control ensures that users can only see and modify data they are authorized to access. Modern data platforms implement role-based access control (RBAC) where permissions are granted to roles, and users are assigned to roles. More granular column-level and row-level security restricts access to sensitive data.

-- Role-Based Access Control (RBAC) in Snowflake
-- Create roles with specific permissions

-- Role hierarchy
CREATE ROLE data_analyst;
CREATE ROLE data_engineer;
CREATE ROLE data_admin;

-- Grant role inheritance
GRANT ROLE data_analyst TO ROLE data_engineer;
GRANT ROLE data_engineer TO ROLE data_admin;

-- Database-level access
GRANT USAGE ON DATABASE analytics TO ROLE data_analyst;
GRANT ALL ON DATABASE analytics TO ROLE data_engineer;

-- Schema-level access
GRANT USAGE ON SCHEMA analytics.marts TO ROLE data_analyst;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics.marts TO ROLE data_analyst;

-- Analysts cannot access staging or raw schemas
-- GRANT USAGE ON SCHEMA analytics.staging TO ROLE data_engineer;

-- Column-level masking for PII
CREATE OR REPLACE MASKING POLICY email_mask AS (val STRING)
RETURNS STRING ->
    CASE
        WHEN CURRENT_ROLE() IN ('DATA_ADMIN', 'DATA_ENGINEER') THEN val
        ELSE REGEXP_REPLACE(val, '.+@', '***@')
    END;

ALTER TABLE analytics.dim_customer
    MODIFY COLUMN email SET MASKING POLICY email_mask;

-- Row-level security: Each team sees only their region's data
CREATE OR REPLACE ROW ACCESS POLICY region_policy AS (region VARCHAR)
RETURNS BOOLEAN ->
    CASE
        WHEN CURRENT_ROLE() = 'DATA_ADMIN' THEN TRUE
        WHEN CURRENT_ROLE() = 'ANALYST_NA' AND region = 'North America' THEN TRUE
        WHEN CURRENT_ROLE() = 'ANALYST_EU' AND region = 'Europe' THEN TRUE
        ELSE FALSE
    END;

ALTER TABLE analytics.fct_orders
    ADD ROW ACCESS POLICY region_policy ON (region);

Data Cataloging

A data catalog is a centralized metadata store that makes datasets discoverable, understandable, and trustworthy. It answers questions like: "What data do we have?", "What does this column mean?", "Who owns this table?", and "When was it last updated?" Popular tools include DataHub, Amundsen, and Atlan.

# DataHub metadata definition for a dataset
dataset:
  urn: "urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.fct_orders,PROD)"
  name: fct_orders
  description: >
    Fact table containing all completed customer orders.
    Grain: one row per order. Updated hourly.
  platform: snowflake
  schema: analytics
  database: ANALYTICS

  ownership:
    owner: "@data-engineering-team"
    steward: "@alice-johnson"

  tags:
    - "pii:false"
    - "tier:gold"
    - "domain:commerce"
    - "refresh:hourly"

  glossaryTerms:
    - "Revenue"
    - "Customer Orders"
    - "E-commerce"

  columns:
    - name: order_id
      type: BIGINT
      description: "Unique identifier for each order"
      tags: ["primary-key"]
    - name: customer_id
      type: BIGINT
      description: "Foreign key to dim_customer"
    - name: total_amount
      type: DECIMAL(12,2)
      description: "Total order amount in USD after discounts"
      glossaryTerms: ["Revenue"]
    - name: order_date
      type: DATE
      description: "Date the order was placed"

  lineage:
    upstream:
      - "raw.app_public.orders"
      - "raw.app_public.order_items"
    downstream:
      - "analytics.monthly_revenue_report"
      - "ml.features.customer_orders"

Data Lineage

Data lineage tracks how data flows from source to destination, through every transformation step. It answers "where did this number come from?" and "what downstream tables will break if I change this column?" Lineage is critical for impact analysis, debugging, and regulatory compliance.

Lineage Levels

  • Table-Level Lineage: Which tables feed into which tables. dbt provides this automatically through ref() dependencies.
  • Column-Level Lineage: Which source columns flow into which target columns. Shows exact data flow for each field.
  • Pipeline-Level Lineage: Which Airflow DAGs, Spark jobs, or dbt models produce each table. Connects code to data.

Privacy and Compliance

-- GDPR / CCPA Compliance Patterns

-- 1. PII Classification: Tag columns containing personal data
COMMENT ON COLUMN analytics.dim_customer.email IS
    'PII: email address. Classification: CONFIDENTIAL. Retention: 3 years.';
COMMENT ON COLUMN analytics.dim_customer.customer_name IS
    'PII: full name. Classification: CONFIDENTIAL. Retention: 3 years.';

-- 2. Right to Deletion (GDPR Article 17)
-- Delete all data for a specific customer
CREATE OR REPLACE PROCEDURE delete_customer_data(customer_id_to_delete BIGINT)
RETURNS STRING
LANGUAGE SQL
AS
BEGIN
    -- Soft delete in dimension
    UPDATE analytics.dim_customer
    SET customer_name = 'DELETED',
        email = 'deleted@deleted.com',
        city = 'DELETED',
        is_deleted = TRUE,
        deleted_at = CURRENT_TIMESTAMP
    WHERE customer_id = :customer_id_to_delete;

    -- Anonymize in fact tables (preserve aggregates)
    UPDATE analytics.fct_orders
    SET customer_id = -1  -- Anonymized customer
    WHERE customer_id = :customer_id_to_delete;

    RETURN 'Customer ' || :customer_id_to_delete || ' data deleted';
END;

-- 3. Data Retention: Automatically purge old data
DELETE FROM analytics.raw_events
WHERE event_date < CURRENT_DATE - INTERVAL '2 years';

-- 4. Audit logging: Track who accessed what data
CREATE TABLE governance.access_log (
    log_id      BIGINT GENERATED ALWAYS AS IDENTITY,
    user_name   VARCHAR(100),
    query_text  TEXT,
    tables_accessed TEXT[],
    timestamp   TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    row_count   BIGINT
);

Data Classification

Classification Description Examples Access
PublicNon-sensitive, shareable externallyProduct catalog, public metricsAll employees
InternalBusiness-sensitive, internal onlyRevenue, user counts, KPIsAll employees
ConfidentialPII, sensitive business dataEmails, names, salariesAuthorized roles only
RestrictedHighly sensitive, regulated dataSSN, payment cards, health dataMinimal access, encrypted

Implementing a Governance Program

Starting a data governance program requires organizational buy-in and a pragmatic approach. Do not try to govern everything at once — start with your most critical datasets and expand from there:

  • Step 1 — Identify Critical Data: Start with the 10-20 most important datasets that drive business decisions. These are your "tier 1" assets that need governance first.
  • Step 2 — Assign Owners: Every dataset must have a clear owner responsible for its quality, freshness, and documentation. Owners are typically domain experts, not the data engineering team.
  • Step 3 — Document: Write descriptions for tables and columns in your data catalog. This is the highest-ROI governance activity — discoverable data is usable data.
  • Step 4 — Classify: Tag columns with their sensitivity level (PII, confidential, public). This determines access controls and retention policies.
  • Step 5 — Enforce: Implement automated policies — masking for PII, retention for old data, freshness alerts for stale data. Manual governance does not scale.
  • Step 6 — Monitor: Track governance metrics: percentage of datasets with owners, documented columns, freshness SLA compliance, and access audit coverage.

Data Mesh and Governance

The data mesh paradigm decentralizes data ownership to domain teams while maintaining centralized governance standards. Each domain team owns their data products end-to-end (ingestion through serving), while a central platform team provides shared infrastructure, standards, and governance tools. This model scales better than centralized data teams in large organizations.

Data Mesh Principles

  • Domain Ownership: Each business domain (orders, customers, payments) owns its data as a product. The domain team is responsible for quality, freshness, and documentation.
  • Data as a Product: Domain data is treated like a product with SLAs, documentation, discoverability, and consumer support.
  • Self-Serve Platform: A central platform team provides shared infrastructure (warehouse, catalog, orchestration) that domain teams use to build their data products.
  • Federated Governance: Governance standards are defined centrally but enforced by each domain. Central rules define naming conventions, classification policies, and quality minimums.

Key Takeaways

  • Data governance is a framework of policies, processes, and standards for managing data as an organizational asset
  • Implement RBAC with column-level masking and row-level security for fine-grained access control
  • Data catalogs make datasets discoverable through metadata, documentation, and search
  • Data lineage tracks how data flows from source to consumption — essential for impact analysis and debugging
  • Privacy compliance (GDPR, CCPA) requires PII classification, right to deletion, data retention policies, and audit logging
  • Data classification (public, internal, confidential, restricted) determines access controls and handling requirements

Continue Learning