← Back to Blog

Databricks Unity Catalog: Schema, Table Governance & PII Security at Scale

Learn how to build a robust PII governance framework using Databricks Unity Catalog — covering column-level security, dynamic data masking, row filters, and audit logging to meet PIPEDA and GDPR requirements.

Databricks Unity Catalog: Schema, Table Governance & PII Security at Scale

Data lakehouses store everything — sales records, clickstreams, medical histories, and social insurance numbers. The same platform that makes your analytics team fast is also the one holding your most sensitive data. In Canada and globally, regulations like PIPEDA, GDPR, and CCPA require organizations to demonstrate exactly who accessed what personal data, when, and for what purpose. Databricks Unity Catalog is the answer to that question.

This article is a practical guide to building a robust PII governance framework using Unity Catalog — covering the three-tier namespace, column-level security, dynamic data masking, row filters, and audit logging.


Databricks Unity Catalog

What Is Unity Catalog?

Unity Catalog (UC) is Databricks' centralized governance layer for all data assets on the Lakehouse. It provides:

  • Unified access control — one place to manage permissions across all workspaces
  • Fine-grained security — down to the column and row level
  • Data lineage — automatic tracking of data movement and transformation
  • Audit logging — immutable logs of every access event
  • Automated PII tagging — classify sensitive columns once, enforce everywhere

Before Unity Catalog, permissions were workspace-scoped and inconsistent across environments. With UC, a single GRANT propagates across every workspace attached to the same metastore.


The Three-Tier Namespace

Unity Catalog organizes all data assets in a three-level hierarchy:

metastore
└── catalog
    └── schema (database)
        ├── table
        ├── view
        └── volume

This is a strict convention. Every table reference must follow the pattern catalog.schema.table.

Setting Up Your Governance Structure

-- Create environment-level catalogs
CREATE CATALOG IF NOT EXISTS prod_catalog
  COMMENT 'Production data — governed by data stewards';

CREATE CATALOG IF NOT EXISTS dev_catalog
  COMMENT 'Development sandbox — no PII allowed';

-- Create domain-level schemas inside the catalog
USE CATALOG prod_catalog;

CREATE SCHEMA IF NOT EXISTS customers
  COMMENT 'Customer master data — contains PII';

CREATE SCHEMA IF NOT EXISTS finance
  COMMENT 'Financial transactions';

CREATE SCHEMA IF NOT EXISTS analytics
  COMMENT 'Aggregated, anonymized data for BI';

Governance principle: separate catalogs per environment (dev, staging, prod) and separate schemas per data domain. This gives you a natural permission boundary before you even write a GRANT.


Identifying and Tagging PII Columns

Unity Catalog supports system-defined tags and custom tags. Tags are the foundation of automated PII governance — tag once, query and enforce everywhere.

-- Tag a table as containing sensitive data
ALTER TABLE prod_catalog.customers.customer_profile
SET TAGS ('sensitivity' = 'PII', 'regulation' = 'PIPEDA,GDPR');

-- Tag specific PII columns
ALTER TABLE prod_catalog.customers.customer_profile
ALTER COLUMN social_insurance_number
SET TAGS ('pii_type' = 'government_id', 'mask' = 'required');

ALTER TABLE prod_catalog.customers.customer_profile
ALTER COLUMN email_address
SET TAGS ('pii_type' = 'contact', 'mask' = 'partial');

ALTER TABLE prod_catalog.customers.customer_profile
ALTER COLUMN date_of_birth
SET TAGS ('pii_type' = 'demographic', 'mask' = 'required');

You can then query the information schema to get a full inventory of PII across your entire Lakehouse:

-- Audit: list all PII-tagged columns across all tables
SELECT
  table_catalog,
  table_schema,
  table_name,
  column_name,
  tag_name,
  tag_value
FROM system.information_schema.column_tags
WHERE tag_name = 'pii_type'
ORDER BY table_catalog, table_schema, table_name;

Column-Level Security (CLS) for PII

Column-Level Security lets you GRANT SELECT on specific columns only, hiding the rest from unauthorized users.

-- Analysts can see everything EXCEPT the SIN and date of birth
GRANT SELECT (customer_id, first_name, last_name, email_address, city, province)
  ON TABLE prod_catalog.customers.customer_profile
  TO `data-analysts@company.com`;

-- Data stewards get full access
GRANT SELECT
  ON TABLE prod_catalog.customers.customer_profile
  TO `data-stewards@company.com`;

-- Revoke broad access that may have been granted previously
REVOKE SELECT
  ON TABLE prod_catalog.customers.customer_profile
  FROM `data-analysts@company.com`;

Column-level privileges are enforced at query time — even if an analyst runs SELECT *, they will only receive the columns they have been explicitly granted.


Dynamic Data Masking

Dynamic Data Masking (DDM) goes one step further: it lets users query a column but returns a masked value based on their identity. The raw data never leaves the Lakehouse.

Creating Masking Policies

-- Mask social insurance number: show only last 3 digits
CREATE OR REPLACE FUNCTION prod_catalog.security.mask_sin(sin STRING)
RETURNS STRING
RETURN CASE
  WHEN is_member('data-stewards') THEN sin
  ELSE CONCAT('***-***-', RIGHT(sin, 3))
END;

-- Mask email: show domain only
CREATE OR REPLACE FUNCTION prod_catalog.security.mask_email(email STRING)
RETURNS STRING
RETURN CASE
  WHEN is_member('data-stewards') THEN email
  WHEN is_member('data-analysts') THEN CONCAT('****@', SPLIT(email, '@')[1])
  ELSE '****@****.***'
END;

-- Mask date of birth: return birth year only
CREATE OR REPLACE FUNCTION prod_catalog.security.mask_dob(dob DATE)
RETURNS DATE
RETURN CASE
  WHEN is_member('data-stewards') THEN dob
  ELSE DATE(CONCAT(YEAR(dob), '-01-01'))
END;

Applying Masking Policies to Tables

ALTER TABLE prod_catalog.customers.customer_profile
ALTER COLUMN social_insurance_number
SET MASKING POLICY prod_catalog.security.mask_sin;

ALTER TABLE prod_catalog.customers.customer_profile
ALTER COLUMN email_address
SET MASKING POLICY prod_catalog.security.mask_email;

ALTER TABLE prod_catalog.customers.customer_profile
ALTER COLUMN date_of_birth
SET MASKING POLICY prod_catalog.security.mask_dob;

Now a data analyst who runs SELECT * FROM prod_catalog.customers.customer_profile will see:

customer_id email_address social_insurance_number date_of_birth
C001 ****@company.com --123 1980-01-01
C002 ****@gmail.com --456 1975-01-01

A data steward sees the actual values. Same query, different results — enforced automatically.


Row-Level Security with Row Filters

Row filters restrict which rows a user can see based on their identity or group membership. This is critical for multi-tenant data, regional compliance, or department segregation.

-- Create a row filter: each region manager only sees their region's data
CREATE OR REPLACE FUNCTION prod_catalog.security.filter_by_region(region STRING)
RETURNS BOOLEAN
RETURN is_member('data-stewards')
    OR region = current_user_region();  -- custom helper function

-- Apply the row filter to the table
ALTER TABLE prod_catalog.customers.customer_profile
SET ROW FILTER prod_catalog.security.filter_by_region ON (province);

For a more practical example — restricting analysts to only their department's customer records:

CREATE OR REPLACE FUNCTION prod_catalog.security.filter_by_account_owner(
  account_owner_email STRING
)
RETURNS BOOLEAN
RETURN is_member('data-stewards')
    OR account_owner_email = CURRENT_USER();

ALTER TABLE prod_catalog.customers.customer_profile
SET ROW FILTER prod_catalog.security.filter_by_account_owner
  ON (account_owner_email);

With this in place, SELECT COUNT(*) FROM prod_catalog.customers.customer_profile returns a different number for every user — showing only the rows they own.


PII in the Medallion Architecture

Unity Catalog integrates naturally with the Bronze → Silver → Gold pipeline. Each layer has a distinct governance posture:

Medallion Architecture with Unity Catalog

Medallion Architecture — Bronze, Silver, Gold layers with Unity Catalog governance

Layer PII Posture Who Has Access
Bronze Raw PII stored as-is Data engineers only (data-engineers group)
Silver PII masked/hashed, columns tagged Data engineers + analysts with CLS + masking
Gold PII fully aggregated or removed All analysts, BI tools, external consumers

Enforcing the Boundary in PySpark

from pyspark.sql import functions as F
from pyspark.sql.functions import sha2, concat_ws

# Bronze → Silver: hash the PII before writing to Silver
df_bronze = spark.table("prod_catalog.bronze.raw_customers")

df_silver = df_bronze.withColumn(
    # One-way hash — preserves join ability, removes reversibility
    "customer_hash", sha2(concat_ws("|", "email_address", "social_insurance_number"), 256)
).withColumn(
    # Generalize date of birth to birth year for analytics use
    "birth_year", F.year("date_of_birth")
).drop(
    "social_insurance_number", "date_of_birth", "email_address"
)

# Write to Silver with Unity Catalog lineage automatically tracked
df_silver.write.format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("prod_catalog.silver.customers_clean")

The key insight: PII never reaches the Gold layer. Analytics and BI tools query Gold tables that contain only aggregated or hashed identifiers — compliant by design, not by policy.


Delta Lake + Unity Catalog: Time Travel for Compliance

Delta Lake

When a regulatory audit asks "who accessed this customer record on March 3rd?", you need two things: Delta Lake's transaction log and Unity Catalog's audit log. Together they give you complete data lineage.

-- Delta time travel: inspect what data looked like at a past point in time
SELECT *
FROM prod_catalog.customers.customer_profile
TIMESTAMP AS OF '2026-01-15 00:00:00'
WHERE customer_id = 'C001';

-- View the full transaction history for a compliance review
DESCRIBE HISTORY prod_catalog.customers.customer_profile;

Unity Catalog writes every SELECT, INSERT, UPDATE, and DELETE to the system audit log:

-- Query Unity Catalog audit logs (available in system.access.audit)
SELECT
  event_time,
  user_identity.email AS user_email,
  action_name,
  request_params.table_full_name AS table_accessed,
  response.status_code
FROM system.access.audit
WHERE action_name IN ('commandSubmit', 'selectTable')
  AND request_params.table_full_name LIKE '%customers%'
  AND event_time >= '2026-01-01'
ORDER BY event_time DESC;

This query is your compliance report. No manual tracking, no spreadsheets — the Lakehouse is its own audit trail.


Best Practices Summary

1. Use groups, not individual users Always GRANT to a group (data-analysts@company.com) rather than individual emails. People change roles; groups don't.

2. Default DENY, explicit ALLOW Unity Catalog's default is no access. Never grant SELECT * on a schema to everyone. Explicit grants are your paper trail.

3. Separate PII catalogs from analytics catalogs PII lives in prod_catalog.customers. Aggregated data lives in prod_catalog.analytics. A BI tool should never touch the customers schema directly.

4. Tag before you govern Tagging columns with pii_type enables automated governance — you can write policies that apply to "all columns tagged government_id" rather than naming individual columns that change with schema evolution.

5. Test masking with impersonation Databricks lets you run queries as another user:

-- Test what an analyst actually sees
SET ROLE `data-analysts@company.com`;
SELECT * FROM prod_catalog.customers.customer_profile LIMIT 5;

6. Review access quarterly Use the information schema to generate a quarterly access report:

SELECT grantee, privilege_type, table_catalog, table_schema, table_name
FROM system.information_schema.table_privileges
WHERE table_catalog = 'prod_catalog'
ORDER BY grantee, table_name;

Conclusion

Unity Catalog transforms PII security from a manual, error-prone process into a systematic, auditable framework. The combination of the three-tier namespace, column-level security, dynamic data masking, row filters, and Delta Lake's transaction log gives you everything a regulator needs to see — and nothing more than each user needs to access.

For Canadian organizations subject to PIPEDA, this architecture supports the key obligations: purpose limitation (row filters by department), data minimization (Gold layer aggregation), access control (CLS + masking), and accountability (audit logs). For global enterprises under GDPR, it satisfies Article 25 (data protection by design) and Article 32 (security of processing) requirements.

The most important takeaway: governance built into the Lakehouse is not a performance tax — it is a business asset. Every GRANT you write, every masking policy you apply, and every PII tag you set is evidence that your organization treats personal data with the seriousness it deserves.


Interested in implementing Unity Catalog governance in your organization? Talk to the LanaCloud team. We specialize in Databricks Lakehouse architecture, Delta Lake pipelines, and enterprise data governance frameworks.

LanaCloud Assistant
Online · usually replies instantly