---
name: data-governance
description: Establish and maintain data governance programs including data quality, data classification, data lineage, metadata management, data catalog, and regulatory compliance. Use when building data governance frameworks, implementing data quality rules, creating data catalogs, managing data classification, or ensuring regulatory compliance. Triggers on phrases like "data governance", "data quality", "data classification", "data lineage", "metadata management", "data catalog", "data steward", "data owner", "data dictionary", "master data management", "MDM", "data standard", "data policy", "data retention", "regulatory compliance", "GDPR", "data privacy".
---

# Data Governance

Establish and maintain data governance programs including data quality, data classification, data lineage, metadata management, and regulatory compliance.

## Workflow

### 1. Governance Framework

```
DATA GOVERNANCE FRAMEWORK
═══════════════════════════════════════

Organizational Structure:
═══════════════════════════════════════

Data Governance Council (Executive):
  → CIO, CDO, CISO, Legal, Compliance
  → Meets quarterly
  → Approves policies, standards, budget
  → Resolves escalations

Data Stewards (Domain Owners):
═══════════════════════════════════════

Domain            Data Owner          Data Steward         Systems
───────────────────────────────────────────────────────────────────────
Customer          VP Sales            Jane Smith           Salesforce, HubSpot
Financial         CFO                 Mike Johnson         Oracle ERP, QuickBooks
HR                CHRO                Sarah Lee            Workday, BambooHR
Product           VP Product          David Chen           Jira, ProductBoard
Operations        COO                 Amy Wong             SAP, ServiceNow

Data Steward Responsibilities:
  → Define data standards for domain
  → Approve data quality rules
  → Resolve data quality issues
  → Maintain data dictionary entries
  → Classify data sensitivity
  → Approve data access requests

Data Custodians (IT):
═══════════════════════════════════════

  → Implement technical controls
  → Maintain data pipelines
  → Execute backups and recovery
  → Monitor data quality alerts
  → Apply data masking/anonymization
  → Manage metadata tools
```

### 2. Data Classification

```
DATA CLASSIFICATION SCHEME
═══════════════════════════════════════

Classification     Description                  Examples                    Controls
───────────────────────────────────────────────────────────────────────────────
Public             No harm if disclosed         Marketing materials         None
Internal           Business impact if disclosed Org charts, processes       Access control
Confidential       Significant harm if          Customer data, financials   Encryption at rest + transit
                   disclosed                                            Access logging, MFA
Restricted         Severe harm if disclosed     PII, PHI, payment data      Encryption + tokenization
                                                         Strict access, auditing, retention

CLASSIFICATION PROCESS:
═══════════════════════════════════════

  1. Discover: Automated scanning for sensitive data
     → Regex patterns (SSN, credit card, email)
     → ML-based classification (DLP tools)
     → Manual review for edge cases

  2. Classify: Assign sensitivity level
     → Auto-classify based on patterns
     → Data steward validates classification
     → Tags applied (database columns, file properties)

  3. Protect: Apply controls based on classification
     → Encryption (Restricted + Confidential)
     → Access controls (all non-Public)
     → Audit logging (Restricted + Confidential)
     → Data masking (non-production)
     → Retention policy (Restricted)

CLASSIFICATION RESULTS:
═══════════════════════════════════════

  Public: 15% of data
  Internal: 40% of data
  Confidential: 35% of data
  Restricted: 10% of data

  Data elements classified: 2,847
  Tables classified: 342
  Files classified: 15,230
```

### 3. Data Quality

```
DATA QUALITY DIMENSIONS
═══════════════════════════════════════

Dimension         Definition                    Metric                Target
───────────────────────────────────────────────────────────────────────────────
Completeness      Required fields populated     % non-null            ≥ 99%
Accuracy          Data matches source of truth  % correct             ≥ 98%
Consistency       Data consistent across systems % matching            ≥ 99%
Timeliness        Data available when needed    Latency (hours)       ≤ 24h
Validity          Data matches format/rules     % valid               ≥ 99%
Uniqueness        No duplicate records          % unique              ≥ 99.5%

DATA QUALITY RULES:
═══════════════════════════════════════

  Rule ID    Table    Column         Rule                   Target   Current
  ────────────────────────────────────────────────────────────────────────────
  DQ-001     customers  email        Is valid email format  100%     99.2%
  DQ-002     customers  phone        Matches phone pattern  99%      98.5%
  DQ-003     customers  name         Not null               100%     100%
  DQ-004     customers  created_at   Not in future          100%     99.8%
  DQ-005     orders     amount       > 0                    100%     99.9%
  DQ-006     orders     customer_id  FK to customers        100%     99.7%
  DQ-007     products   sku          Unique                 100%     100%
  DQ-008     products   price        > 0 AND < 1000000      100%     99.9%

DATA QUALITY MONITORING:
═══════════════════════════════════════

  → Automated checks: Nightly pipeline
  → Alert: Slack #data-quality channel
  → SLA: Critical rules fixed within 24 hours
  → Dashboard: Data quality scorecard (weekly)
  → Trending: Monthly quality report to governance council

  Current Data Quality Score: 99.1% (target: 99.5%)
  Rules passing: 94 of 98 (96%)
  Critical issues: 2 (remediating)
```

### 4. Data Lineage & Catalog

```
DATA LINEAGE
═══════════════════════════════════════

  Source → Pipeline → Warehouse → Consumption

  Example Lineage:
═══════════════════════════════════════

  Salesforce (Customer)
    → CDC (Debezium)
      → Kafka (raw_customers_topic)
        → Spark (transform: deduplicate, enrich)
          → Snowflake (stg.customers)
            → dbt (model: dim_customers)
              → Tableau (Customer Dashboard)
              → ML Pipeline (Churn Model)

  Example Lineage:
═══════════════════════════════════════

  Data Catalog (Amundsen/Alation/ Purview):
═══════════════════════════════════════

  Features:
    → Search: Find datasets by name, description, tags
    → Metadata: Technical (schema) + Business (glossary)
    → Lineage: Visual dependency graph
    → Usage: Query frequency, popular dashboards
    → Ownership: Data owner, steward, contact
    → Quality: DQ score, SLA status
    → Access: Request access (workflow)

  Adoption Metrics:
    → Tables cataloged: 342 of 380 (90%)
    → Business glossary terms: 156
    → Active catalog users: 45 (target: 60)
    → Average search-to-access time: 15 minutes (down from 2 hours)
```

### 5. Regulatory Compliance

```
REGULATORY COMPLIANCE
═══════════════════════════════════════

GDPR Compliance:
═══════════════════════════════════════

  Requirement                  Implementation                Status
  ────────────────────────────────────────────────────────────────────────
  Lawful basis                 Privacy notice + consent       ✓ Compliant
  Right to access              Data subject request portal    ✓ Compliant
  Right to erasure             Automated deletion pipeline    ✓ Compliant
  Data portability             Export in JSON/CSV             ✓ Compliant
  DPO                          Appointed (Jane Smith)         ✓ Compliant
  DPA                          Annual assessment              ✓ Compliant
  Cross-border transfer        SCCs + adequacy decision       ✓ Compliant
  Breach notification          72-hour process defined        ✓ Compliant

Data Retention:
═══════════════════════════════════════

  Data Type           Retention    Disposition          Tool
  ────────────────────────────────────────────────────────────────────────
  Customer records    7 years      Secure delete        AWS S3 lifecycle
  Financial records   10 years     Archive → delete     Oracle ERP policy
  Employee records    7 years      Secure delete        Workday retention
  Email               3 years      Archive → delete     M365 retention
  Logs                1 year       Delete               CloudWatch policy
  Backups             30 days      Delete               Backup lifecycle
  Personal data       Per purpose  Delete on request    GDPR pipeline

CCPA Compliance:
═══════════════════════════════════════

  → "Do Not Sell" link on website
  → Data subject request handling
  → Vendor assessment (third-party data sharing)
  → Privacy policy updated
```

## Edge Cases

- **Legacy systems**: Manual classification and lineage
- **Multi-cloud**: Cross-cloud data governance
- **Edge devices**: IoT data governance
- **Federated data**: Governance across organizations
- **Unstructured data**: Document classification (ML-based)

## Integration Points

- **Catalog**: Amundsen, Alation, Purview, Datahub
- **Quality**: Great Expectations, dbt tests, Monte Carlo
- **Classification**: DLP tools, Purview, Macie
- **Lineage**: Apache Atlas, OpenLineage, dbt docs
- **Compliance**: OneTrust, TrustArc, custom workflows
- **Storage**: Snowflake, Redshift, BigQuery, Databricks

## Output

### Data Governance Status

```
DATA GOVERNANCE — Q4 2024
═══════════════════════════════════════

Data quality score: 99.1% (target: 99.5%)
Classification coverage: 100% of critical data
Catalog adoption: 90% of tables cataloged
Glossary terms: 156
Compliance: GDPR ✓, CCPA ✓, PCI ✓
Data stewards: 6 active
Open DQ issues: 4 (2 critical, remediating)
```
