IT AI Skill
Data Governance
Establish and maintain data governance programs including data quality, data classification, data lineage, metadata management, data catalog, and regulatory compliance. Use when building data governance frameworks, implementing data quality rules, creating...
Data Governance
Establish and maintain data governance programs including data quality, data classification, data lineage, metadata management, and regulatory compliance.
Workflow
1. Governance Framework
DATA GOVERNANCE FRAMEWORK
═══════════════════════════════════════
Organizational Structure:
═══════════════════════════════════════
Data Governance Council (Executive):
→ CIO, CDO, CISO, Legal, Compliance
→ Meets quarterly
→ Approves policies, standards, budget
→ Resolves escalations
Data Stewards (Domain Owners):
═══════════════════════════════════════
Domain Data Owner Data Steward Systems
───────────────────────────────────────────────────────────────────────
Customer VP Sales Jane Smith Salesforce, HubSpot
Financial CFO Mike Johnson Oracle ERP, QuickBooks
HR CHRO Sarah Lee Workday, BambooHR
Product VP Product David Chen Jira, ProductBoard
Operations COO Amy Wong SAP, ServiceNow
Data Steward Responsibilities:
→ Define data standards for domain
→ Approve data quality rules
→ Resolve data quality issues
→ Maintain data dictionary entries
→ Classify data sensitivity
→ Approve data access requests
Data Custodians (IT):
═══════════════════════════════════════
→ Implement technical controls
→ Maintain data pipelines
→ Execute backups and recovery
→ Monitor data quality alerts
→ Apply data masking/anonymization
→ Manage metadata tools
2. Data Classification
DATA CLASSIFICATION SCHEME
═══════════════════════════════════════
Classification Description Examples Controls
───────────────────────────────────────────────────────────────────────────────
Public No harm if disclosed Marketing materials None
Internal Business impact if disclosed Org charts, processes Access control
Confidential Significant harm if Customer data, financials Encryption at rest + transit
disclosed Access logging, MFA
Restricted Severe harm if disclosed PII, PHI, payment data Encryption + tokenization
Strict access, auditing, retention
CLASSIFICATION PROCESS:
═══════════════════════════════════════
1. Discover: Automated scanning for sensitive data
→ Regex patterns (SSN, credit card, email)
→ ML-based classification (DLP tools)
→ Manual review for edge cases
2. Classify: Assign sensitivity level
→ Auto-classify based on patterns
→ Data steward validates classification
→ Tags applied (database columns, file properties)
3. Protect: Apply controls based on classification
→ Encryption (Restricted + Confidential)
→ Access controls (all non-Public)
→ Audit logging (Restricted + Confidential)
→ Data masking (non-production)
→ Retention policy (Restricted)
CLASSIFICATION RESULTS:
═══════════════════════════════════════
Public: 15% of data
Internal: 40% of data
Confidential: 35% of data
Restricted: 10% of data
Data elements classified: 2,847
Tables classified: 342
Files classified: 15,230
3. Data Quality
DATA QUALITY DIMENSIONS
═══════════════════════════════════════
Dimension Definition Metric Target
───────────────────────────────────────────────────────────────────────────────
Completeness Required fields populated % non-null ≥ 99%
Accuracy Data matches source of truth % correct ≥ 98%
Consistency Data consistent across systems % matching ≥ 99%
Timeliness Data available when needed Latency (hours) ≤ 24h
Validity Data matches format/rules % valid ≥ 99%
Uniqueness No duplicate records % unique ≥ 99.5%
DATA QUALITY RULES:
═══════════════════════════════════════
Rule ID Table Column Rule Target Current
────────────────────────────────────────────────────────────────────────────
DQ-001 customers email Is valid email format 100% 99.2%
DQ-002 customers phone Matches phone pattern 99% 98.5%
DQ-003 customers name Not null 100% 100%
DQ-004 customers created_at Not in future 100% 99.8%
DQ-005 orders amount > 0 100% 99.9%
DQ-006 orders customer_id FK to customers 100% 99.7%
DQ-007 products sku Unique 100% 100%
DQ-008 products price > 0 AND < 1000000 100% 99.9%
DATA QUALITY MONITORING:
═══════════════════════════════════════
→ Automated checks: Nightly pipeline
→ Alert: Slack #data-quality channel
→ SLA: Critical rules fixed within 24 hours
→ Dashboard: Data quality scorecard (weekly)
→ Trending: Monthly quality report to governance council
Current Data Quality Score: 99.1% (target: 99.5%)
Rules passing: 94 of 98 (96%)
Critical issues: 2 (remediating)
4. Data Lineage & Catalog
DATA LINEAGE
═══════════════════════════════════════
Source → Pipeline → Warehouse → Consumption
Example Lineage:
═══════════════════════════════════════
Salesforce (Customer)
→ CDC (Debezium)
→ Kafka (raw_customers_topic)
→ Spark (transform: deduplicate, enrich)
→ Snowflake (stg.customers)
→ dbt (model: dim_customers)
→ Tableau (Customer Dashboard)
→ ML Pipeline (Churn Model)
Example Lineage:
═══════════════════════════════════════
Data Catalog (Amundsen/Alation/ Purview):
═══════════════════════════════════════
Features:
→ Search: Find datasets by name, description, tags
→ Metadata: Technical (schema) + Business (glossary)
→ Lineage: Visual dependency graph
→ Usage: Query frequency, popular dashboards
→ Ownership: Data owner, steward, contact
→ Quality: DQ score, SLA status
→ Access: Request access (workflow)
Adoption Metrics:
→ Tables cataloged: 342 of 380 (90%)
→ Business glossary terms: 156
→ Active catalog users: 45 (target: 60)
→ Average search-to-access time: 15 minutes (down from 2 hours)
5. Regulatory Compliance
REGULATORY COMPLIANCE
═══════════════════════════════════════
GDPR Compliance:
═══════════════════════════════════════
Requirement Implementation Status
────────────────────────────────────────────────────────────────────────
Lawful basis Privacy notice + consent ✓ Compliant
Right to access Data subject request portal ✓ Compliant
Right to erasure Automated deletion pipeline ✓ Compliant
Data portability Export in JSON/CSV ✓ Compliant
DPO Appointed (Jane Smith) ✓ Compliant
DPA Annual assessment ✓ Compliant
Cross-border transfer SCCs + adequacy decision ✓ Compliant
Breach notification 72-hour process defined ✓ Compliant
Data Retention:
═══════════════════════════════════════
Data Type Retention Disposition Tool
────────────────────────────────────────────────────────────────────────
Customer records 7 years Secure delete AWS S3 lifecycle
Financial records 10 years Archive → delete Oracle ERP policy
Employee records 7 years Secure delete Workday retention
Email 3 years Archive → delete M365 retention
Logs 1 year Delete CloudWatch policy
Backups 30 days Delete Backup lifecycle
Personal data Per purpose Delete on request GDPR pipeline
CCPA Compliance:
═══════════════════════════════════════
→ "Do Not Sell" link on website
→ Data subject request handling
→ Vendor assessment (third-party data sharing)
→ Privacy policy updated
Edge Cases
- Legacy systems: Manual classification and lineage
- Multi-cloud: Cross-cloud data governance
- Edge devices: IoT data governance
- Federated data: Governance across organizations
- Unstructured data: Document classification (ML-based)
Integration Points
- Catalog: Amundsen, Alation, Purview, Datahub
- Quality: Great Expectations, dbt tests, Monte Carlo
- Classification: DLP tools, Purview, Macie
- Lineage: Apache Atlas, OpenLineage, dbt docs
- Compliance: OneTrust, TrustArc, custom workflows
- Storage: Snowflake, Redshift, BigQuery, Databricks
Output
Data Governance Status
DATA GOVERNANCE — Q4 2024
═══════════════════════════════════════
Data quality score: 99.1% (target: 99.5%)
Classification coverage: 100% of critical data
Catalog adoption: 90% of tables cataloged
Glossary terms: 156
Compliance: GDPR ✓, CCPA ✓, PCI ✓
Data stewards: 6 active
Open DQ issues: 4 (2 critical, remediating)