---
name: data-architecture-design
description: Design and manage enterprise data architecture including data lakes, data warehouses, data meshes, ETL/ELT pipelines, data governance, and data quality frameworks. Use when designing data platforms, building data pipelines, implementing data governance, planning data warehouse modernization, establishing data quality standards, or architecting analytics solutions. Triggers on phrases like "data architecture", "data warehouse", "data lake", "ETL pipeline", "data mesh", "data governance", "analytics platform", "data modeling", "data integration", "data quality".
---

# Data Architecture & Design

Design enterprise data platforms, pipelines, governance frameworks, and analytics infrastructure.

## Workflow

1. Assess current data landscape: sources, formats, volumes, quality, ownership, and usage patterns.
2. Define data architecture vision: centralized (warehouse), decentralized (mesh), or hybrid approach.
3. Design data platform: storage layer, processing layer, serving layer, governance layer.
4. Build data pipelines: ingestion (batch + streaming), transformation, loading, orchestration.
5. Implement data modeling: dimensional model, data vault, or normalized schema per use case.
6. Establish data governance: data catalog, data dictionary, ownership, quality rules, access control.
7. Deploy analytics and BI: self-service analytics, dashboards, reporting, data science enablement.
8. Monitor data health: pipeline reliability, data quality scores, freshness, cost tracking.
9. Optimize data platform: cost management, performance tuning, architecture evolution.
10. Scale and evolve: add new data sources, support new use cases, adopt new technologies.

## Data Platform Architecture

```
DATA PLATFORM ARCHITECTURE OPTIONS
====================================

Pattern 1: Data Warehouse (Centralized)

  Architecture:
    Source Systems → ETL → Data Warehouse → BI / Analytics / Reporting

  Technology: Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse
  Strengths: structured data, SQL-based, strong governance, ACID compliance
  Limitations: semi-structured data handling, real-time analytics, flexibility
  Best for: traditional BI, financial reporting, compliance reporting
  Cost: $50K–$500K/year (depending on data volume and concurrency)
  Team: 2–5 data engineers + 1–2 BI developers

  Snowflake pricing (example):
    Compute (warehouses): $3–$1,800/hour per warehouse (XS to 4X-Large)
    Storage: $23/TB/month (compressed)
    Data transfer: $0.005/GB (within region, free in some cases)
    Typical monthly cost: $5,000–$50,000 for mid-size company

Pattern 2: Data Lake (Raw + Processed)

  Architecture:
    Source Systems → Ingestion → Data Lake (raw) → Transformation → Data Lake (curated) → Analytics

  Technology: AWS S3 + Glue + Athena, Azure Data Lake + Databricks, GCP GCS + Dataproc
  Strengths: any data format, scalable, cost-effective storage, machine learning ready
  Limitations: governance complexity, data quality, schema enforcement
  Best for: raw data storage, ML training, log analysis, IoT data
  Cost: $10K–$200K/year (storage is cheap; compute varies)
  Team: 3–8 data engineers + 1–3 data scientists

Pattern 3: Data Lakehouse (Best of Both)

  Architecture:
    Source Systems → Ingestion → Lakehouse (raw + curated) → BI / ML / Analytics

  Technology: Databricks Delta Lake, Apache Iceberg, Apache Hudi
  Strengths: ACID on data lake, schema enforcement, time travel, BI + ML support
  Limitations: newer technology, tooling maturity, skill requirements
  Best for: unified platform for BI and ML, modern data stacks
  Cost: $50K–$500K/year
  Team: 3–10 data engineers + 2–5 analysts/scientists

Pattern 4: Data Mesh (Decentralized)

  Architecture:
    Domain teams own their data products → Data product platform → Consumer self-service

  Technology: Domain-specific databases + central platform (data catalog, governance)
  Strengths: scalability, domain ownership, autonomy, reduced bottlenecks
  Limitations: organizational change required, platform investment, governance complexity
  Best for: large enterprises (1000+ employees), multiple business domains
  Cost: $200K–$2M+ /year (organizational investment + technology)
  Team: central platform team (5–10) + domain data teams (per domain)

Technology Stack Comparison:

  Component            Warehouse Stack        Lake Stack          Lakehouse Stack
  ──────────────────   ──────────────────     ──────────────      ─────────────────
  Storage             DW storage             S3 / ADLS / GCS     S3 + Delta/Iceberg
  Processing          DW SQL                 Spark / Glue         Spark / Databricks
  Orchestration       Airflow / dbt          Airflow / Prefect   Airflow / dbt
  Serving             BI tools               Presto / Trino      BI + Direct Lake
  Governance          Native                 Atlas / Purview      Purview / Amundsen
  Quality             Native                 Great Expectations   Monte Carlo / Soda
  Cost (annual)       $50K–$500K             $10K–$200K          $50K–$500K
```

## Data Pipeline Design

```
DATA PIPELINE ARCHITECTURE
============================

Pipeline types:

  1. Batch Pipelines (daily/hourly):
     Use case: ETL from source systems, nightly data warehouse loading, daily reports
     Tools: Apache Airflow, dbt, AWS Glue, Azure Data Factory, Fivetran, Stitch
     Pattern: Extract → Transform → Load (ETL) or Extract → Load → Transform (ELT)
     Schedule: hourly, daily, weekly depending on data freshness requirement
     Volume: GB to TB per run
     SLA: complete within defined window (e.g., daily load completes by 6 AM)

  2. Streaming Pipelines (real-time):
     Use case: real-time analytics, fraud detection, IoT telemetry, clickstream
     Tools: Apache Kafka, AWS Kinesis, Azure Event Hubs, Apache Flink, Spark Streaming
     Pattern: Source → Message Queue → Stream Processor → Sink (warehouse, lake, dashboard)
     Latency: seconds to minutes (sub-second for critical paths)
     Volume: KB to GB per second
     SLA: process within defined latency (e.g., < 5 seconds end-to-end)

  3. CDC (Change Data Capture):
     Use case: near-real-time data replication, audit trail, incremental loading
     Tools: Debezium, Fivetran CDC, AWS DMS, Striim, Qlik Replicate
     Pattern: Read database transaction logs → capture changes → replicate to target
     Latency: seconds to minutes behind source
     Volume: only changed rows (typically 1–10% of total data per day)
     SLA: replication lag < 5 minutes (typical)

Pipeline design principles:

  Idempotency:
    - Pipelines can be re-run without duplicating data
    - Use upserts (merge) instead of inserts
    - Track processed records with watermark/timestamp
    - Implement exactly-once or at-least-once semantics

  Error Handling:
    - Retry transient failures (3–5 attempts with exponential backoff)
    - Dead letter queue for unprocessable records
    - Alert on pipeline failure within 5 minutes
    - Runbook for common failure scenarios
    - Automated recovery for known failure patterns

  Monitoring:
    - Pipeline run status: success, failure, running, queued
    - Data volume: rows processed, data size (alert on anomalies)
    - Freshness: last successful run timestamp (alert if behind schedule)
    - Data quality: row count validation, null checks, schema validation
    - Cost: compute hours, storage growth, data transfer costs

  Data Contract:
    - Source system provides data in agreed format and schema
    - Pipeline validates data against contract before processing
    - Schema changes require notification and pipeline update
    - Data quality SLA defined between source owner and pipeline owner

Sample pipeline (Airflow + dbt + Snowflake):

  DAG: daily_data_load
    Schedule: 01:00 UTC daily

    Task 1: extract_source_data
      - Connect to source (API, database, file)
      - Extract new/changed data since last run
      - Load to staging area (raw zone in data lake or staging schema)
      - Duration: 5–30 minutes
      - SLA: complete by 01:30 UTC

    Task 2: validate_raw_data
      - Check row count (vs. expected range)
      - Check for null values in required fields
      - Check data types and formats
      - Alert if validation fails
      - Duration: 1–5 minutes

    Task 3: transform_data (dbt)
      - Clean and standardize data
      - Join with dimension tables
      - Apply business logic and calculations
      - Load to curated/production schema
      - Duration: 10–60 minutes (depends on volume)
      - SLA: complete by 02:30 UTC

    Task 4: validate_curated_data
      - Row count reconciliation (source vs. target)
      - Business rule validation
      - Data quality score calculation
      - Duration: 1–5 minutes

    Task 5: notify_completion
      - Send success/failure notification
      - Update data freshness dashboard
      - Trigger downstream processes (reports, ML model retraining)
```

## Data Modeling

```
DATA MODELING APPROACHES
=========================

Dimensional Modeling (Kimball):

  Structure:
    Fact tables: measurable events (sales, clicks, transactions)
    Dimension tables: descriptive context (product, customer, time, location)

  Example:

    Fact_Sales (fact table):
      sale_id, date_key, product_key, customer_key, store_key,
      quantity, unit_price, total_amount, discount_amount, profit

    Dim_Product (dimension table):
      product_key, product_id, product_name, category, subcategory,
      brand, unit_of_measure, weight, status

    Dim_Customer (dimension table):
      customer_key, customer_id, customer_name, segment, region, country,
      registration_date, lifetime_value, status

    Dim_Time (dimension table):
      date_key, date, day_of_week, month, quarter, year, holiday_flag, fiscal_period

  Benefits: simple, intuitive, fast query performance, BI-friendly
  Best for: reporting, BI dashboards, ad-hoc analysis
  Tools: dbt, SQL scripts, data warehouse native tools

Data Vault Modeling (Inmon/Davis):

  Structure:
    Hubs: unique business keys (product, customer)
    Links: relationships between hubs (product-purchased-by-customer)
    Satellites: descriptive attributes with timestamps (product attributes, audit trail)

  Benefits: audit trail, flexible, handles complex relationships, scalable
  Best for: enterprise data warehouse, compliance, audit requirements
  Tools: Informatica, Talend, custom ETL

Normalized Modeling (3NF):

  Structure:
    Tables normalized to Third Normal Form (eliminate redundancy)
    Foreign keys reference primary keys
    Denormalization views created for query performance

  Benefits: data integrity, minimal redundancy, efficient storage
  Best for: operational databases, transactional systems
  Limitations: complex queries, joins impact performance

  Example schema:
    Customers(customer_id PK, name, email, created_at)
    Orders(order_id PK, customer_id FK, order_date, status, total)
    Order_Items(item_id PK, order_id FK, product_id FK, quantity, price)
    Products(product_id PK, name, category_id FK, price, status)
    Categories(category_id PK, name, parent_category_id FK)

Data Model Selection Guide:

  Use Dimensional when:
    - Primary use case is BI/reporting
    - Analysts need self-service access
    - Query performance is critical
    - Data volume < 100 TB

  Use Data Vault when:
    - Enterprise-scale warehouse (> 100 TB)
    - Multiple source systems with overlapping data
    - Audit trail and data lineage required
    - Complex integration scenarios

  Use Normalized when:
    - Building operational database
    - Transactional integrity critical
    - Data volume < 10 TB
    - OLTP workload
```

## Data Governance

```
DATA GOVERNANCE FRAMEWORK
============================

Components:

  1. Data Catalog:
     - Inventory of all data assets (tables, columns, APIs, dashboards)
     - Business glossary (definitions, owners, classifications)
     - Data lineage (source → transformation → consumption)
     - Usage metrics (query frequency, dashboard consumption)
     - Tools: Alation, Collibra, Ataccama, AWS Glue Data Catalog, Purview

  2. Data Dictionary:
     - Technical metadata: column name, data type, length, constraints
     - Business metadata: business definition, owner, sensitivity level
     - Operational metadata: refresh frequency, source system, last updated
     - Example entry:

       Column: customer_id
       Table: Dim_Customer
       Type: VARCHAR(36) — UUID
       Business Definition: Unique identifier for a customer record
       Owner: CRM Team (john.doe@company.com)
       Sensitivity: Internal
       Source: Salesforce → MDM → Data Warehouse
       Refresh: Daily at 02:00 UTC
       Sample Values: "550e8400-e29b-41d4-a716-446655440000"

  3. Data Quality Framework:
     - Completeness: % of non-null values in required columns (target: > 99%)
     - Accuracy: values match source of truth (target: > 99.5%)
     - Consistency: same value means same thing across systems (target: > 98%)
     - Timeliness: data available within SLA (target: > 95%)
     - Uniqueness: no duplicate records (target: > 99.9%)
     - Validity: values conform to defined rules/format (target: > 99%)

     Quality monitoring:
       - Automated checks run with each pipeline
       - Quality score per table: average of all dimension scores
       - Alert on quality score drop below threshold
       - Quality dashboard for data stewards

  4. Data Classification:
     - Public: no restriction; can be shared externally
     - Internal: company use only; no external sharing
     - Confidential: restricted access; need business justification
     - Restricted: sensitive data (PII, PHI, financial); role-based access; audit logging
     - Regulated: compliance-critical data; encryption required; retention policies

     Classification process:
       - Auto-classification: scan data for PII patterns (SSN, email, phone)
       - Manual classification: data owners assign sensitivity levels
       - Review: quarterly classification audit
       - Enforcement: access control based on classification

  5. Data Ownership:
     - Business data owner: accountable for data quality and usage (business leader)
     - Technical data owner: responsible for data pipeline and platform (engineer)
     - Data steward: day-to-day data quality and governance (analyst)
     - Data consumer: uses data for analysis/reporting (any employee)

     RACI matrix example:
       Activity               Owner    Steward   Engineer   Consumer
       ────────────────────  ────────  ─────────  ──────────  ─────────
       Define business rules  A        R         C           I
       Monitor data quality   A        R         C           I
       Fix data issues        I        C         R           I
       Approve access         A        C         I           R
       Use data for reports   I        I         I           R
```

## Integration Points

- **Snowflake / BigQuery / Redshift**: Cloud data warehouses; SQL analytics; data sharing; native ML
- **Databricks**: Lakehouse platform; Spark processing; Delta Lake; ML workflows; collaborative notebooks
- **Fivetran / Airbyte / Stitch**: Automated data ingestion; 200+ connectors; CDC support; schema change handling
- **dbt (data build tool)**: SQL-based transformation; testing; documentation; CI/CD for data models
- **Apache Airflow / Prefect / Dagster**: Pipeline orchestration; scheduling; monitoring; dependency management
- **Apache Kafka / Pulsar**: Streaming data platform; event streaming; real-time analytics
- **Monte Carlo / Datafold / Great Expectations**: Data quality monitoring; anomaly detection; data testing
- **Alation / Collibra / Purview**: Data catalog; governance; business glossary; data lineage
- **Tableau / Power BI / Looker**: BI and visualization; self-service analytics; dashboard publishing

## Edge Cases

- **Real-time analytics at scale** (millions of events per second): Apache Kafka for ingestion; Flink/Spark Streaming for processing; Materialize/kSQL for real-time materialized views; ClickHouse/Druid for real-time OLAP; sub-second latency requirement
  - Architecture: Kafka → Flink (processing) → ClickHouse (serving) → Dashboard (real-time)
  - Infrastructure: Kafka cluster (5–9 brokers); Flink cluster (auto-scaling); ClickHouse (distributed)
  - Cost: $10K–$100K/month depending on throughput and retention
  - Complexity: high; requires specialized streaming engineering skills

- **Data mesh implementation** (organizational transformation): Domain-driven data ownership; self-serve data platform; data as product; federated governance; 12–24 month transformation; change management critical; start with 2–3 pilot domains
  - Platform requirements: data catalog, quality framework, lineage, access management, CI/CD for data
  - Domain team structure: data product owner, data engineer, data analyst per domain
  - Challenges: cultural shift from central to decentralized; platform readiness; skill gaps
  - Timeline: Phase 1 (6 months): platform + 2 pilot domains; Phase 2 (6–12 months): expand to all domains

- **Legacy data warehouse migration** (Teradata/Oracle → Snowflake/BigQuery): Assess current schema and ETL; redesign for target platform; parallel run for validation; data reconciliation; cutover during maintenance window; decommission legacy after 90-day validation
  - Assessment: inventory all tables, views, stored procedures, ETL jobs, reports (200–500+ objects)
  - Migration: automated conversion tools (80% automated, 20% manual) + custom rewrite
  - Validation: row counts, aggregation checks, sample query comparison
  - Timeline: 3–9 months depending on complexity
  - Cost: $100K–$1M (tools + labor); cloud cost may be 30–50% less than legacy

- **Multi-tenant data architecture** (SaaS with data isolation per tenant): Schema-per-tenant vs. shared-schema with tenant_id; row-level security; data isolation testing; cross-tenant aggregation (analytics); tenant-specific data retention; compliance per tenant jurisdiction
  - Schema-per-tenant: full isolation; higher operational cost; simpler security; best for enterprise tenants
  - Shared schema: cost-efficient; complex security; row-level security enforcement; best for SMB tenants
  - Hybrid: schema-per-tenant for enterprise, shared for SMB

- **Data residency and sovereignty** (data must stay in specific regions): Region-specific data storage; cross-region replication restrictions; data transfer logging; compliance reporting per region; architecture design with geographic boundaries
  - EU data (GDPR): store in EU region; limited transfer outside EU; SCC or adequacy decision required
  - China (PIPL): data stored in China; CAC security assessment for cross-border transfer
  - Healthcare (HIPAA): BAA with cloud provider; encryption at rest and in transit; access logging
  - Architecture: regional data hubs with global analytics (aggregate, anonymized data only for cross-region)

- **Data cost management** (cloud data platform costs spiraling): Implement FinOps for data; track cost per pipeline, per team, per query; query optimization (avoid full table scans, use clustering); storage tiering (hot/warm/cold); auto-suspend compute; budget alerts
  - Snowflake: auto-suspend warehouses after 5–10 min idle; right-size warehouse size; monitor credit consumption
  - BigQuery: use slot-based pricing for predictable costs; partition tables; cluster on frequently filtered columns
  - Databricks: auto-termination for clusters; photon engine for performance; spot instances for fault-tolerant workloads
  - Alert: daily cost > $X; monthly budget > $Y; single query cost > $Z
