What is Data Governance?
Data governance is the framework of policies, processes, and standards that ensures data is managed as a valuable organizational asset. It encompasses data security, privacy compliance, access controls, data cataloging, lineage tracking, and ownership policies. Good governance makes data trustworthy, discoverable, and compliant with regulations — without it, data becomes a liability rather than an asset.
Data governance is not just a compliance checkbox — it is an enabler of data democratization. When governance is done well, more people can safely access and use data because clear policies define who can see what, how data should be used, and what quality standards it meets. Without governance, organizations either restrict data access too tightly (killing productivity) or too loosely (risking breaches and compliance violations).
Pillars of Data Governance
- Data Security: Protecting data from unauthorized access through encryption, authentication, and access controls
- Privacy Compliance: Meeting GDPR, CCPA, HIPAA, and other regulatory requirements for personal data
- Data Cataloging: Making data discoverable through metadata management, documentation, and search
- Data Lineage: Tracking how data flows from sources through transformations to final consumption
- Data Ownership: Assigning clear accountability for data quality, accuracy, and maintenance
- Data Quality: Establishing and enforcing standards for accuracy, completeness, and timeliness
Access Control
Access control ensures that users can only see and modify data they are authorized to access. Modern data platforms implement role-based access control (RBAC) where permissions are granted to roles, and users are assigned to roles. More granular column-level and row-level security restricts access to sensitive data.
-- Role-Based Access Control (RBAC) in Snowflake
-- Create roles with specific permissions
-- Role hierarchy
CREATE ROLE data_analyst;
CREATE ROLE data_engineer;
CREATE ROLE data_admin;
-- Grant role inheritance
GRANT ROLE data_analyst TO ROLE data_engineer;
GRANT ROLE data_engineer TO ROLE data_admin;
-- Database-level access
GRANT USAGE ON DATABASE analytics TO ROLE data_analyst;
GRANT ALL ON DATABASE analytics TO ROLE data_engineer;
-- Schema-level access
GRANT USAGE ON SCHEMA analytics.marts TO ROLE data_analyst;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics.marts TO ROLE data_analyst;
-- Analysts cannot access staging or raw schemas
-- GRANT USAGE ON SCHEMA analytics.staging TO ROLE data_engineer;
-- Column-level masking for PII
CREATE OR REPLACE MASKING POLICY email_mask AS (val STRING)
RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('DATA_ADMIN', 'DATA_ENGINEER') THEN val
ELSE REGEXP_REPLACE(val, '.+@', '***@')
END;
ALTER TABLE analytics.dim_customer
MODIFY COLUMN email SET MASKING POLICY email_mask;
-- Row-level security: Each team sees only their region's data
CREATE OR REPLACE ROW ACCESS POLICY region_policy AS (region VARCHAR)
RETURNS BOOLEAN ->
CASE
WHEN CURRENT_ROLE() = 'DATA_ADMIN' THEN TRUE
WHEN CURRENT_ROLE() = 'ANALYST_NA' AND region = 'North America' THEN TRUE
WHEN CURRENT_ROLE() = 'ANALYST_EU' AND region = 'Europe' THEN TRUE
ELSE FALSE
END;
ALTER TABLE analytics.fct_orders
ADD ROW ACCESS POLICY region_policy ON (region);
Data Cataloging
A data catalog is a centralized metadata store that makes datasets discoverable, understandable, and trustworthy. It answers questions like: "What data do we have?", "What does this column mean?", "Who owns this table?", and "When was it last updated?" Popular tools include DataHub, Amundsen, and Atlan.
# DataHub metadata definition for a dataset
dataset:
urn: "urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.fct_orders,PROD)"
name: fct_orders
description: >
Fact table containing all completed customer orders.
Grain: one row per order. Updated hourly.
platform: snowflake
schema: analytics
database: ANALYTICS
ownership:
owner: "@data-engineering-team"
steward: "@alice-johnson"
tags:
- "pii:false"
- "tier:gold"
- "domain:commerce"
- "refresh:hourly"
glossaryTerms:
- "Revenue"
- "Customer Orders"
- "E-commerce"
columns:
- name: order_id
type: BIGINT
description: "Unique identifier for each order"
tags: ["primary-key"]
- name: customer_id
type: BIGINT
description: "Foreign key to dim_customer"
- name: total_amount
type: DECIMAL(12,2)
description: "Total order amount in USD after discounts"
glossaryTerms: ["Revenue"]
- name: order_date
type: DATE
description: "Date the order was placed"
lineage:
upstream:
- "raw.app_public.orders"
- "raw.app_public.order_items"
downstream:
- "analytics.monthly_revenue_report"
- "ml.features.customer_orders"
Data Lineage
Data lineage tracks how data flows from source to destination, through every transformation step. It answers "where did this number come from?" and "what downstream tables will break if I change this column?" Lineage is critical for impact analysis, debugging, and regulatory compliance.
Lineage Levels
- Table-Level Lineage: Which tables feed into which tables. dbt provides this automatically through ref() dependencies.
- Column-Level Lineage: Which source columns flow into which target columns. Shows exact data flow for each field.
- Pipeline-Level Lineage: Which Airflow DAGs, Spark jobs, or dbt models produce each table. Connects code to data.
Privacy and Compliance
-- GDPR / CCPA Compliance Patterns
-- 1. PII Classification: Tag columns containing personal data
COMMENT ON COLUMN analytics.dim_customer.email IS
'PII: email address. Classification: CONFIDENTIAL. Retention: 3 years.';
COMMENT ON COLUMN analytics.dim_customer.customer_name IS
'PII: full name. Classification: CONFIDENTIAL. Retention: 3 years.';
-- 2. Right to Deletion (GDPR Article 17)
-- Delete all data for a specific customer
CREATE OR REPLACE PROCEDURE delete_customer_data(customer_id_to_delete BIGINT)
RETURNS STRING
LANGUAGE SQL
AS
BEGIN
-- Soft delete in dimension
UPDATE analytics.dim_customer
SET customer_name = 'DELETED',
email = 'deleted@deleted.com',
city = 'DELETED',
is_deleted = TRUE,
deleted_at = CURRENT_TIMESTAMP
WHERE customer_id = :customer_id_to_delete;
-- Anonymize in fact tables (preserve aggregates)
UPDATE analytics.fct_orders
SET customer_id = -1 -- Anonymized customer
WHERE customer_id = :customer_id_to_delete;
RETURN 'Customer ' || :customer_id_to_delete || ' data deleted';
END;
-- 3. Data Retention: Automatically purge old data
DELETE FROM analytics.raw_events
WHERE event_date < CURRENT_DATE - INTERVAL '2 years';
-- 4. Audit logging: Track who accessed what data
CREATE TABLE governance.access_log (
log_id BIGINT GENERATED ALWAYS AS IDENTITY,
user_name VARCHAR(100),
query_text TEXT,
tables_accessed TEXT[],
timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
row_count BIGINT
);
Data Classification
| Classification | Description | Examples | Access |
|---|---|---|---|
| Public | Non-sensitive, shareable externally | Product catalog, public metrics | All employees |
| Internal | Business-sensitive, internal only | Revenue, user counts, KPIs | All employees |
| Confidential | PII, sensitive business data | Emails, names, salaries | Authorized roles only |
| Restricted | Highly sensitive, regulated data | SSN, payment cards, health data | Minimal access, encrypted |
Implementing a Governance Program
Starting a data governance program requires organizational buy-in and a pragmatic approach. Do not try to govern everything at once — start with your most critical datasets and expand from there:
- Step 1 — Identify Critical Data: Start with the 10-20 most important datasets that drive business decisions. These are your "tier 1" assets that need governance first.
- Step 2 — Assign Owners: Every dataset must have a clear owner responsible for its quality, freshness, and documentation. Owners are typically domain experts, not the data engineering team.
- Step 3 — Document: Write descriptions for tables and columns in your data catalog. This is the highest-ROI governance activity — discoverable data is usable data.
- Step 4 — Classify: Tag columns with their sensitivity level (PII, confidential, public). This determines access controls and retention policies.
- Step 5 — Enforce: Implement automated policies — masking for PII, retention for old data, freshness alerts for stale data. Manual governance does not scale.
- Step 6 — Monitor: Track governance metrics: percentage of datasets with owners, documented columns, freshness SLA compliance, and access audit coverage.
Data Mesh and Governance
The data mesh paradigm decentralizes data ownership to domain teams while maintaining centralized governance standards. Each domain team owns their data products end-to-end (ingestion through serving), while a central platform team provides shared infrastructure, standards, and governance tools. This model scales better than centralized data teams in large organizations.
Data Mesh Principles
- Domain Ownership: Each business domain (orders, customers, payments) owns its data as a product. The domain team is responsible for quality, freshness, and documentation.
- Data as a Product: Domain data is treated like a product with SLAs, documentation, discoverability, and consumer support.
- Self-Serve Platform: A central platform team provides shared infrastructure (warehouse, catalog, orchestration) that domain teams use to build their data products.
- Federated Governance: Governance standards are defined centrally but enforced by each domain. Central rules define naming conventions, classification policies, and quality minimums.
Key Takeaways
- Data governance is a framework of policies, processes, and standards for managing data as an organizational asset
- Implement RBAC with column-level masking and row-level security for fine-grained access control
- Data catalogs make datasets discoverable through metadata, documentation, and search
- Data lineage tracks how data flows from source to consumption — essential for impact analysis and debugging
- Privacy compliance (GDPR, CCPA) requires PII classification, right to deletion, data retention policies, and audit logging
- Data classification (public, internal, confidential, restricted) determines access controls and handling requirements