IT AI Skill

Data Architecture Design

Design and manage enterprise data architecture including data lakes, data warehouses, data meshes, ETL/ELT pipelines, data governance, and data quality frameworks. Use when designing data platforms, building data pipelines, implementing data governance, pla...

Data Architecture & Design

Design enterprise data platforms, pipelines, governance frameworks, and analytics infrastructure.

Workflow

  1. Assess current data landscape: sources, formats, volumes, quality, ownership, and usage patterns.
  2. Define data architecture vision: centralized (warehouse), decentralized (mesh), or hybrid approach.
  3. Design data platform: storage layer, processing layer, serving layer, governance layer.
  4. Build data pipelines: ingestion (batch + streaming), transformation, loading, orchestration.
  5. Implement data modeling: dimensional model, data vault, or normalized schema per use case.
  6. Establish data governance: data catalog, data dictionary, ownership, quality rules, access control.
  7. Deploy analytics and BI: self-service analytics, dashboards, reporting, data science enablement.
  8. Monitor data health: pipeline reliability, data quality scores, freshness, cost tracking.
  9. Optimize data platform: cost management, performance tuning, architecture evolution.
  10. Scale and evolve: add new data sources, support new use cases, adopt new technologies.

Data Platform Architecture

DATA PLATFORM ARCHITECTURE OPTIONS
====================================

Pattern 1: Data Warehouse (Centralized)

  Architecture:
    Source Systems → ETL → Data Warehouse → BI / Analytics / Reporting

  Technology: Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse
  Strengths: structured data, SQL-based, strong governance, ACID compliance
  Limitations: semi-structured data handling, real-time analytics, flexibility
  Best for: traditional BI, financial reporting, compliance reporting
  Cost: $50K–$500K/year (depending on data volume and concurrency)
  Team: 2–5 data engineers + 1–2 BI developers

  Snowflake pricing (example):
    Compute (warehouses): $3–$1,800/hour per warehouse (XS to 4X-Large)
    Storage: $23/TB/month (compressed)
    Data transfer: $0.005/GB (within region, free in some cases)
    Typical monthly cost: $5,000–$50,000 for mid-size company

Pattern 2: Data Lake (Raw + Processed)

  Architecture:
    Source Systems → Ingestion → Data Lake (raw) → Transformation → Data Lake (curated) → Analytics

  Technology: AWS S3 + Glue + Athena, Azure Data Lake + Databricks, GCP GCS + Dataproc
  Strengths: any data format, scalable, cost-effective storage, machine learning ready
  Limitations: governance complexity, data quality, schema enforcement
  Best for: raw data storage, ML training, log analysis, IoT data
  Cost: $10K–$200K/year (storage is cheap; compute varies)
  Team: 3–8 data engineers + 1–3 data scientists

Pattern 3: Data Lakehouse (Best of Both)

  Architecture:
    Source Systems → Ingestion → Lakehouse (raw + curated) → BI / ML / Analytics

  Technology: Databricks Delta Lake, Apache Iceberg, Apache Hudi
  Strengths: ACID on data lake, schema enforcement, time travel, BI + ML support
  Limitations: newer technology, tooling maturity, skill requirements
  Best for: unified platform for BI and ML, modern data stacks
  Cost: $50K–$500K/year
  Team: 3–10 data engineers + 2–5 analysts/scientists

Pattern 4: Data Mesh (Decentralized)

  Architecture:
    Domain teams own their data products → Data product platform → Consumer self-service

  Technology: Domain-specific databases + central platform (data catalog, governance)
  Strengths: scalability, domain ownership, autonomy, reduced bottlenecks
  Limitations: organizational change required, platform investment, governance complexity
  Best for: large enterprises (1000+ employees), multiple business domains
  Cost: $200K–$2M+ /year (organizational investment + technology)
  Team: central platform team (5–10) + domain data teams (per domain)

Technology Stack Comparison:

  Component            Warehouse Stack        Lake Stack          Lakehouse Stack
  ──────────────────   ──────────────────     ──────────────      ─────────────────
  Storage             DW storage             S3 / ADLS / GCS     S3 + Delta/Iceberg
  Processing          DW SQL                 Spark / Glue         Spark / Databricks
  Orchestration       Airflow / dbt          Airflow / Prefect   Airflow / dbt
  Serving             BI tools               Presto / Trino      BI + Direct Lake
  Governance          Native                 Atlas / Purview      Purview / Amundsen
  Quality             Native                 Great Expectations   Monte Carlo / Soda
  Cost (annual)       $50K–$500K             $10K–$200K          $50K–$500K

Data Pipeline Design

DATA PIPELINE ARCHITECTURE
============================

Pipeline types:

  1. Batch Pipelines (daily/hourly):
     Use case: ETL from source systems, nightly data warehouse loading, daily reports
     Tools: Apache Airflow, dbt, AWS Glue, Azure Data Factory, Fivetran, Stitch
     Pattern: Extract → Transform → Load (ETL) or Extract → Load → Transform (ELT)
     Schedule: hourly, daily, weekly depending on data freshness requirement
     Volume: GB to TB per run
     SLA: complete within defined window (e.g., daily load completes by 6 AM)

  2. Streaming Pipelines (real-time):
     Use case: real-time analytics, fraud detection, IoT telemetry, clickstream
     Tools: Apache Kafka, AWS Kinesis, Azure Event Hubs, Apache Flink, Spark Streaming
     Pattern: Source → Message Queue → Stream Processor → Sink (warehouse, lake, dashboard)
     Latency: seconds to minutes (sub-second for critical paths)
     Volume: KB to GB per second
     SLA: process within defined latency (e.g., < 5 seconds end-to-end)

  3. CDC (Change Data Capture):
     Use case: near-real-time data replication, audit trail, incremental loading
     Tools: Debezium, Fivetran CDC, AWS DMS, Striim, Qlik Replicate
     Pattern: Read database transaction logs → capture changes → replicate to target
     Latency: seconds to minutes behind source
     Volume: only changed rows (typically 1–10% of total data per day)
     SLA: replication lag < 5 minutes (typical)

Pipeline design principles:

  Idempotency:
    - Pipelines can be re-run without duplicating data
    - Use upserts (merge) instead of inserts
    - Track processed records with watermark/timestamp
    - Implement exactly-once or at-least-once semantics

  Error Handling:
    - Retry transient failures (3–5 attempts with exponential backoff)
    - Dead letter queue for unprocessable records
    - Alert on pipeline failure within 5 minutes
    - Runbook for common failure scenarios
    - Automated recovery for known failure patterns

  Monitoring:
    - Pipeline run status: success, failure, running, queued
    - Data volume: rows processed, data size (alert on anomalies)
    - Freshness: last successful run timestamp (alert if behind schedule)
    - Data quality: row count validation, null checks, schema validation
    - Cost: compute hours, storage growth, data transfer costs

  Data Contract:
    - Source system provides data in agreed format and schema
    - Pipeline validates data against contract before processing
    - Schema changes require notification and pipeline update
    - Data quality SLA defined between source owner and pipeline owner

Sample pipeline (Airflow + dbt + Snowflake):

  DAG: daily_data_load
    Schedule: 01:00 UTC daily

    Task 1: extract_source_data
      - Connect to source (API, database, file)
      - Extract new/changed data since last run
      - Load to staging area (raw zone in data lake or staging schema)
      - Duration: 5–30 minutes
      - SLA: complete by 01:30 UTC

    Task 2: validate_raw_data
      - Check row count (vs. expected range)
      - Check for null values in required fields
      - Check data types and formats
      - Alert if validation fails
      - Duration: 1–5 minutes

    Task 3: transform_data (dbt)
      - Clean and standardize data
      - Join with dimension tables
      - Apply business logic and calculations
      - Load to curated/production schema
      - Duration: 10–60 minutes (depends on volume)
      - SLA: complete by 02:30 UTC

    Task 4: validate_curated_data
      - Row count reconciliation (source vs. target)
      - Business rule validation
      - Data quality score calculation
      - Duration: 1–5 minutes

    Task 5: notify_completion
      - Send success/failure notification
      - Update data freshness dashboard
      - Trigger downstream processes (reports, ML model retraining)

Data Modeling

DATA MODELING APPROACHES
=========================

Dimensional Modeling (Kimball):

  Structure:
    Fact tables: measurable events (sales, clicks, transactions)
    Dimension tables: descriptive context (product, customer, time, location)

  Example:

    Fact_Sales (fact table):
      sale_id, date_key, product_key, customer_key, store_key,
      quantity, unit_price, total_amount, discount_amount, profit

    Dim_Product (dimension table):
      product_key, product_id, product_name, category, subcategory,
      brand, unit_of_measure, weight, status

    Dim_Customer (dimension table):
      customer_key, customer_id, customer_name, segment, region, country,
      registration_date, lifetime_value, status

    Dim_Time (dimension table):
      date_key, date, day_of_week, month, quarter, year, holiday_flag, fiscal_period

  Benefits: simple, intuitive, fast query performance, BI-friendly
  Best for: reporting, BI dashboards, ad-hoc analysis
  Tools: dbt, SQL scripts, data warehouse native tools

Data Vault Modeling (Inmon/Davis):

  Structure:
    Hubs: unique business keys (product, customer)
    Links: relationships between hubs (product-purchased-by-customer)
    Satellites: descriptive attributes with timestamps (product attributes, audit trail)

  Benefits: audit trail, flexible, handles complex relationships, scalable
  Best for: enterprise data warehouse, compliance, audit requirements
  Tools: Informatica, Talend, custom ETL

Normalized Modeling (3NF):

  Structure:
    Tables normalized to Third Normal Form (eliminate redundancy)
    Foreign keys reference primary keys
    Denormalization views created for query performance

  Benefits: data integrity, minimal redundancy, efficient storage
  Best for: operational databases, transactional systems
  Limitations: complex queries, joins impact performance

  Example schema:
    Customers(customer_id PK, name, email, created_at)
    Orders(order_id PK, customer_id FK, order_date, status, total)
    Order_Items(item_id PK, order_id FK, product_id FK, quantity, price)
    Products(product_id PK, name, category_id FK, price, status)
    Categories(category_id PK, name, parent_category_id FK)

Data Model Selection Guide:

  Use Dimensional when:
    - Primary use case is BI/reporting
    - Analysts need self-service access
    - Query performance is critical
    - Data volume < 100 TB

  Use Data Vault when:
    - Enterprise-scale warehouse (> 100 TB)
    - Multiple source systems with overlapping data
    - Audit trail and data lineage required
    - Complex integration scenarios

  Use Normalized when:
    - Building operational database
    - Transactional integrity critical
    - Data volume < 10 TB
    - OLTP workload

Data Governance

DATA GOVERNANCE FRAMEWORK
============================

Components:

  1. Data Catalog:
     - Inventory of all data assets (tables, columns, APIs, dashboards)
     - Business glossary (definitions, owners, classifications)
     - Data lineage (source → transformation → consumption)
     - Usage metrics (query frequency, dashboard consumption)
     - Tools: Alation, Collibra, Ataccama, AWS Glue Data Catalog, Purview

  2. Data Dictionary:
     - Technical metadata: column name, data type, length, constraints
     - Business metadata: business definition, owner, sensitivity level
     - Operational metadata: refresh frequency, source system, last updated
     - Example entry:

       Column: customer_id
       Table: Dim_Customer
       Type: VARCHAR(36) — UUID
       Business Definition: Unique identifier for a customer record
       Owner: CRM Team ([email protected])
       Sensitivity: Internal
       Source: Salesforce → MDM → Data Warehouse
       Refresh: Daily at 02:00 UTC
       Sample Values: "550e8400-e29b-41d4-a716-446655440000"

  3. Data Quality Framework:
     - Completeness: % of non-null values in required columns (target: > 99%)
     - Accuracy: values match source of truth (target: > 99.5%)
     - Consistency: same value means same thing across systems (target: > 98%)
     - Timeliness: data available within SLA (target: > 95%)
     - Uniqueness: no duplicate records (target: > 99.9%)
     - Validity: values conform to defined rules/format (target: > 99%)

     Quality monitoring:
       - Automated checks run with each pipeline
       - Quality score per table: average of all dimension scores
       - Alert on quality score drop below threshold
       - Quality dashboard for data stewards

  4. Data Classification:
     - Public: no restriction; can be shared externally
     - Internal: company use only; no external sharing
     - Confidential: restricted access; need business justification
     - Restricted: sensitive data (PII, PHI, financial); role-based access; audit logging
     - Regulated: compliance-critical data; encryption required; retention policies

     Classification process:
       - Auto-classification: scan data for PII patterns (SSN, email, phone)
       - Manual classification: data owners assign sensitivity levels
       - Review: quarterly classification audit
       - Enforcement: access control based on classification

  5. Data Ownership:
     - Business data owner: accountable for data quality and usage (business leader)
     - Technical data owner: responsible for data pipeline and platform (engineer)
     - Data steward: day-to-day data quality and governance (analyst)
     - Data consumer: uses data for analysis/reporting (any employee)

     RACI matrix example:
       Activity               Owner    Steward   Engineer   Consumer
       ────────────────────  ────────  ─────────  ──────────  ─────────
       Define business rules  A        R         C           I
       Monitor data quality   A        R         C           I
       Fix data issues        I        C         R           I
       Approve access         A        C         I           R
       Use data for reports   I        I         I           R

Integration Points

Edge Cases