IT AI Skill
Data Architecture Design
Design and manage enterprise data architecture including data lakes, data warehouses, data meshes, ETL/ELT pipelines, data governance, and data quality frameworks. Use when designing data platforms, building data pipelines, implementing data governance, pla...
Data Architecture & Design
Design enterprise data platforms, pipelines, governance frameworks, and analytics infrastructure.
Workflow
- Assess current data landscape: sources, formats, volumes, quality, ownership, and usage patterns.
- Define data architecture vision: centralized (warehouse), decentralized (mesh), or hybrid approach.
- Design data platform: storage layer, processing layer, serving layer, governance layer.
- Build data pipelines: ingestion (batch + streaming), transformation, loading, orchestration.
- Implement data modeling: dimensional model, data vault, or normalized schema per use case.
- Establish data governance: data catalog, data dictionary, ownership, quality rules, access control.
- Deploy analytics and BI: self-service analytics, dashboards, reporting, data science enablement.
- Monitor data health: pipeline reliability, data quality scores, freshness, cost tracking.
- Optimize data platform: cost management, performance tuning, architecture evolution.
- Scale and evolve: add new data sources, support new use cases, adopt new technologies.
Data Platform Architecture
DATA PLATFORM ARCHITECTURE OPTIONS
====================================
Pattern 1: Data Warehouse (Centralized)
Architecture:
Source Systems → ETL → Data Warehouse → BI / Analytics / Reporting
Technology: Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse
Strengths: structured data, SQL-based, strong governance, ACID compliance
Limitations: semi-structured data handling, real-time analytics, flexibility
Best for: traditional BI, financial reporting, compliance reporting
Cost: $50K–$500K/year (depending on data volume and concurrency)
Team: 2–5 data engineers + 1–2 BI developers
Snowflake pricing (example):
Compute (warehouses): $3–$1,800/hour per warehouse (XS to 4X-Large)
Storage: $23/TB/month (compressed)
Data transfer: $0.005/GB (within region, free in some cases)
Typical monthly cost: $5,000–$50,000 for mid-size company
Pattern 2: Data Lake (Raw + Processed)
Architecture:
Source Systems → Ingestion → Data Lake (raw) → Transformation → Data Lake (curated) → Analytics
Technology: AWS S3 + Glue + Athena, Azure Data Lake + Databricks, GCP GCS + Dataproc
Strengths: any data format, scalable, cost-effective storage, machine learning ready
Limitations: governance complexity, data quality, schema enforcement
Best for: raw data storage, ML training, log analysis, IoT data
Cost: $10K–$200K/year (storage is cheap; compute varies)
Team: 3–8 data engineers + 1–3 data scientists
Pattern 3: Data Lakehouse (Best of Both)
Architecture:
Source Systems → Ingestion → Lakehouse (raw + curated) → BI / ML / Analytics
Technology: Databricks Delta Lake, Apache Iceberg, Apache Hudi
Strengths: ACID on data lake, schema enforcement, time travel, BI + ML support
Limitations: newer technology, tooling maturity, skill requirements
Best for: unified platform for BI and ML, modern data stacks
Cost: $50K–$500K/year
Team: 3–10 data engineers + 2–5 analysts/scientists
Pattern 4: Data Mesh (Decentralized)
Architecture:
Domain teams own their data products → Data product platform → Consumer self-service
Technology: Domain-specific databases + central platform (data catalog, governance)
Strengths: scalability, domain ownership, autonomy, reduced bottlenecks
Limitations: organizational change required, platform investment, governance complexity
Best for: large enterprises (1000+ employees), multiple business domains
Cost: $200K–$2M+ /year (organizational investment + technology)
Team: central platform team (5–10) + domain data teams (per domain)
Technology Stack Comparison:
Component Warehouse Stack Lake Stack Lakehouse Stack
────────────────── ────────────────── ────────────── ─────────────────
Storage DW storage S3 / ADLS / GCS S3 + Delta/Iceberg
Processing DW SQL Spark / Glue Spark / Databricks
Orchestration Airflow / dbt Airflow / Prefect Airflow / dbt
Serving BI tools Presto / Trino BI + Direct Lake
Governance Native Atlas / Purview Purview / Amundsen
Quality Native Great Expectations Monte Carlo / Soda
Cost (annual) $50K–$500K $10K–$200K $50K–$500K
Data Pipeline Design
DATA PIPELINE ARCHITECTURE
============================
Pipeline types:
1. Batch Pipelines (daily/hourly):
Use case: ETL from source systems, nightly data warehouse loading, daily reports
Tools: Apache Airflow, dbt, AWS Glue, Azure Data Factory, Fivetran, Stitch
Pattern: Extract → Transform → Load (ETL) or Extract → Load → Transform (ELT)
Schedule: hourly, daily, weekly depending on data freshness requirement
Volume: GB to TB per run
SLA: complete within defined window (e.g., daily load completes by 6 AM)
2. Streaming Pipelines (real-time):
Use case: real-time analytics, fraud detection, IoT telemetry, clickstream
Tools: Apache Kafka, AWS Kinesis, Azure Event Hubs, Apache Flink, Spark Streaming
Pattern: Source → Message Queue → Stream Processor → Sink (warehouse, lake, dashboard)
Latency: seconds to minutes (sub-second for critical paths)
Volume: KB to GB per second
SLA: process within defined latency (e.g., < 5 seconds end-to-end)
3. CDC (Change Data Capture):
Use case: near-real-time data replication, audit trail, incremental loading
Tools: Debezium, Fivetran CDC, AWS DMS, Striim, Qlik Replicate
Pattern: Read database transaction logs → capture changes → replicate to target
Latency: seconds to minutes behind source
Volume: only changed rows (typically 1–10% of total data per day)
SLA: replication lag < 5 minutes (typical)
Pipeline design principles:
Idempotency:
- Pipelines can be re-run without duplicating data
- Use upserts (merge) instead of inserts
- Track processed records with watermark/timestamp
- Implement exactly-once or at-least-once semantics
Error Handling:
- Retry transient failures (3–5 attempts with exponential backoff)
- Dead letter queue for unprocessable records
- Alert on pipeline failure within 5 minutes
- Runbook for common failure scenarios
- Automated recovery for known failure patterns
Monitoring:
- Pipeline run status: success, failure, running, queued
- Data volume: rows processed, data size (alert on anomalies)
- Freshness: last successful run timestamp (alert if behind schedule)
- Data quality: row count validation, null checks, schema validation
- Cost: compute hours, storage growth, data transfer costs
Data Contract:
- Source system provides data in agreed format and schema
- Pipeline validates data against contract before processing
- Schema changes require notification and pipeline update
- Data quality SLA defined between source owner and pipeline owner
Sample pipeline (Airflow + dbt + Snowflake):
DAG: daily_data_load
Schedule: 01:00 UTC daily
Task 1: extract_source_data
- Connect to source (API, database, file)
- Extract new/changed data since last run
- Load to staging area (raw zone in data lake or staging schema)
- Duration: 5–30 minutes
- SLA: complete by 01:30 UTC
Task 2: validate_raw_data
- Check row count (vs. expected range)
- Check for null values in required fields
- Check data types and formats
- Alert if validation fails
- Duration: 1–5 minutes
Task 3: transform_data (dbt)
- Clean and standardize data
- Join with dimension tables
- Apply business logic and calculations
- Load to curated/production schema
- Duration: 10–60 minutes (depends on volume)
- SLA: complete by 02:30 UTC
Task 4: validate_curated_data
- Row count reconciliation (source vs. target)
- Business rule validation
- Data quality score calculation
- Duration: 1–5 minutes
Task 5: notify_completion
- Send success/failure notification
- Update data freshness dashboard
- Trigger downstream processes (reports, ML model retraining)
Data Modeling
DATA MODELING APPROACHES
=========================
Dimensional Modeling (Kimball):
Structure:
Fact tables: measurable events (sales, clicks, transactions)
Dimension tables: descriptive context (product, customer, time, location)
Example:
Fact_Sales (fact table):
sale_id, date_key, product_key, customer_key, store_key,
quantity, unit_price, total_amount, discount_amount, profit
Dim_Product (dimension table):
product_key, product_id, product_name, category, subcategory,
brand, unit_of_measure, weight, status
Dim_Customer (dimension table):
customer_key, customer_id, customer_name, segment, region, country,
registration_date, lifetime_value, status
Dim_Time (dimension table):
date_key, date, day_of_week, month, quarter, year, holiday_flag, fiscal_period
Benefits: simple, intuitive, fast query performance, BI-friendly
Best for: reporting, BI dashboards, ad-hoc analysis
Tools: dbt, SQL scripts, data warehouse native tools
Data Vault Modeling (Inmon/Davis):
Structure:
Hubs: unique business keys (product, customer)
Links: relationships between hubs (product-purchased-by-customer)
Satellites: descriptive attributes with timestamps (product attributes, audit trail)
Benefits: audit trail, flexible, handles complex relationships, scalable
Best for: enterprise data warehouse, compliance, audit requirements
Tools: Informatica, Talend, custom ETL
Normalized Modeling (3NF):
Structure:
Tables normalized to Third Normal Form (eliminate redundancy)
Foreign keys reference primary keys
Denormalization views created for query performance
Benefits: data integrity, minimal redundancy, efficient storage
Best for: operational databases, transactional systems
Limitations: complex queries, joins impact performance
Example schema:
Customers(customer_id PK, name, email, created_at)
Orders(order_id PK, customer_id FK, order_date, status, total)
Order_Items(item_id PK, order_id FK, product_id FK, quantity, price)
Products(product_id PK, name, category_id FK, price, status)
Categories(category_id PK, name, parent_category_id FK)
Data Model Selection Guide:
Use Dimensional when:
- Primary use case is BI/reporting
- Analysts need self-service access
- Query performance is critical
- Data volume < 100 TB
Use Data Vault when:
- Enterprise-scale warehouse (> 100 TB)
- Multiple source systems with overlapping data
- Audit trail and data lineage required
- Complex integration scenarios
Use Normalized when:
- Building operational database
- Transactional integrity critical
- Data volume < 10 TB
- OLTP workload
Data Governance
DATA GOVERNANCE FRAMEWORK
============================
Components:
1. Data Catalog:
- Inventory of all data assets (tables, columns, APIs, dashboards)
- Business glossary (definitions, owners, classifications)
- Data lineage (source → transformation → consumption)
- Usage metrics (query frequency, dashboard consumption)
- Tools: Alation, Collibra, Ataccama, AWS Glue Data Catalog, Purview
2. Data Dictionary:
- Technical metadata: column name, data type, length, constraints
- Business metadata: business definition, owner, sensitivity level
- Operational metadata: refresh frequency, source system, last updated
- Example entry:
Column: customer_id
Table: Dim_Customer
Type: VARCHAR(36) — UUID
Business Definition: Unique identifier for a customer record
Owner: CRM Team ([email protected])
Sensitivity: Internal
Source: Salesforce → MDM → Data Warehouse
Refresh: Daily at 02:00 UTC
Sample Values: "550e8400-e29b-41d4-a716-446655440000"
3. Data Quality Framework:
- Completeness: % of non-null values in required columns (target: > 99%)
- Accuracy: values match source of truth (target: > 99.5%)
- Consistency: same value means same thing across systems (target: > 98%)
- Timeliness: data available within SLA (target: > 95%)
- Uniqueness: no duplicate records (target: > 99.9%)
- Validity: values conform to defined rules/format (target: > 99%)
Quality monitoring:
- Automated checks run with each pipeline
- Quality score per table: average of all dimension scores
- Alert on quality score drop below threshold
- Quality dashboard for data stewards
4. Data Classification:
- Public: no restriction; can be shared externally
- Internal: company use only; no external sharing
- Confidential: restricted access; need business justification
- Restricted: sensitive data (PII, PHI, financial); role-based access; audit logging
- Regulated: compliance-critical data; encryption required; retention policies
Classification process:
- Auto-classification: scan data for PII patterns (SSN, email, phone)
- Manual classification: data owners assign sensitivity levels
- Review: quarterly classification audit
- Enforcement: access control based on classification
5. Data Ownership:
- Business data owner: accountable for data quality and usage (business leader)
- Technical data owner: responsible for data pipeline and platform (engineer)
- Data steward: day-to-day data quality and governance (analyst)
- Data consumer: uses data for analysis/reporting (any employee)
RACI matrix example:
Activity Owner Steward Engineer Consumer
──────────────────── ──────── ───────── ────────── ─────────
Define business rules A R C I
Monitor data quality A R C I
Fix data issues I C R I
Approve access A C I R
Use data for reports I I I R
Integration Points
- Snowflake / BigQuery / Redshift: Cloud data warehouses; SQL analytics; data sharing; native ML
- Databricks: Lakehouse platform; Spark processing; Delta Lake; ML workflows; collaborative notebooks
- Fivetran / Airbyte / Stitch: Automated data ingestion; 200+ connectors; CDC support; schema change handling
- dbt (data build tool): SQL-based transformation; testing; documentation; CI/CD for data models
- Apache Airflow / Prefect / Dagster: Pipeline orchestration; scheduling; monitoring; dependency management
- Apache Kafka / Pulsar: Streaming data platform; event streaming; real-time analytics
- Monte Carlo / Datafold / Great Expectations: Data quality monitoring; anomaly detection; data testing
- Alation / Collibra / Purview: Data catalog; governance; business glossary; data lineage
- Tableau / Power BI / Looker: BI and visualization; self-service analytics; dashboard publishing
Edge Cases
- Real-time analytics at scale (millions of events per second): Apache Kafka for ingestion; Flink/Spark Streaming for processing; Materialize/kSQL for real-time materialized views; ClickHouse/Druid for real-time OLAP; sub-second latency requirement
- Architecture: Kafka → Flink (processing) → ClickHouse (serving) → Dashboard (real-time)
- Infrastructure: Kafka cluster (5–9 brokers); Flink cluster (auto-scaling); ClickHouse (distributed)
- Cost: $10K–$100K/month depending on throughput and retention
- Complexity: high; requires specialized streaming engineering skills
- Data mesh implementation (organizational transformation): Domain-driven data ownership; self-serve data platform; data as product; federated governance; 12–24 month transformation; change management critical; start with 2–3 pilot domains
- Platform requirements: data catalog, quality framework, lineage, access management, CI/CD for data
- Domain team structure: data product owner, data engineer, data analyst per domain
- Challenges: cultural shift from central to decentralized; platform readiness; skill gaps
- Timeline: Phase 1 (6 months): platform + 2 pilot domains; Phase 2 (6–12 months): expand to all domains
- Legacy data warehouse migration (Teradata/Oracle → Snowflake/BigQuery): Assess current schema and ETL; redesign for target platform; parallel run for validation; data reconciliation; cutover during maintenance window; decommission legacy after 90-day validation
- Assessment: inventory all tables, views, stored procedures, ETL jobs, reports (200–500+ objects)
- Migration: automated conversion tools (80% automated, 20% manual) + custom rewrite
- Validation: row counts, aggregation checks, sample query comparison
- Timeline: 3–9 months depending on complexity
- Cost: $100K–$1M (tools + labor); cloud cost may be 30–50% less than legacy
- Multi-tenant data architecture (SaaS with data isolation per tenant): Schema-per-tenant vs. shared-schema with tenant_id; row-level security; data isolation testing; cross-tenant aggregation (analytics); tenant-specific data retention; compliance per tenant jurisdiction
- Schema-per-tenant: full isolation; higher operational cost; simpler security; best for enterprise tenants
- Shared schema: cost-efficient; complex security; row-level security enforcement; best for SMB tenants
- Hybrid: schema-per-tenant for enterprise, shared for SMB
- Data residency and sovereignty (data must stay in specific regions): Region-specific data storage; cross-region replication restrictions; data transfer logging; compliance reporting per region; architecture design with geographic boundaries
- EU data (GDPR): store in EU region; limited transfer outside EU; SCC or adequacy decision required
- China (PIPL): data stored in China; CAC security assessment for cross-border transfer
- Healthcare (HIPAA): BAA with cloud provider; encryption at rest and in transit; access logging
- Architecture: regional data hubs with global analytics (aggregate, anonymized data only for cross-region)
- Data cost management (cloud data platform costs spiraling): Implement FinOps for data; track cost per pipeline, per team, per query; query optimization (avoid full table scans, use clustering); storage tiering (hot/warm/cold); auto-suspend compute; budget alerts
- Snowflake: auto-suspend warehouses after 5–10 min idle; right-size warehouse size; monitor credit consumption
- BigQuery: use slot-based pricing for predictable costs; partition tables; cluster on frequently filtered columns
- Databricks: auto-termination for clusters; photon engine for performance; spot instances for fault-tolerant workloads
- Alert: daily cost > $X; monthly budget > $Y; single query cost > $Z