IT AI Skill

Ai Ml Operations

Manage AI/ML model lifecycle including model deployment orchestration, feature store management, model monitoring and drift detection, automated retraining pipelines, model governance and compliance, A/B testing frameworks, and ML cost optimization. Use whe...

AI/ML Operations

Deploy, monitor, govern, and optimize machine learning models at enterprise scale.

Workflow

  1. Establish MLOps foundation: model registry, CI/CD pipeline, monitoring infrastructure, governance framework.
  2. Design training pipeline: data ingestion, feature engineering, model training, validation, registration.
  3. Implement deployment strategy: canary, A/B, shadow, or blue-green deployment with automated rollback.
  4. Set up continuous monitoring: performance metrics, data drift, concept drift, infrastructure health.
  5. Configure automated retraining: trigger conditions, pipeline orchestration, validation gates, approval workflow.
  6. Establish governance: model cards, bias testing, explainability, audit trails, regulatory compliance.
  7. Optimize costs: inference compute, feature storage, training resources, pipeline efficiency.
  8. Scale operations: multi-model management, team enablement, self-service ML platform.

Model Registry & Versioning

Enterprise Model Registry Architecture

MODEL REGISTRY — ENTERPRISE ARCHITECTURE
==========================================

Registry Platform: MLflow + Custom Extensions (or SageMaker Model Registry, W&B)
Total Registered Models: 47 (across all teams)

MODEL INVENTORY:
  ┌───────────────────────────┬──────────┬────────────┬────────────┬──────────────────────┬──────────────┐
  │ Model Name                │ Version  │ Framework  │ Status     │ Deployed To          │ Owner        │
  ├───────────────────────────┼──────────┼────────────┼────────────┼──────────────────────┼──────────────┤
  │ Fraud Detection Engine    │ 3.2.1    │ XGBoost    │ Production │ API Gateway (us-east)│ Risk Team    │
  │ Customer Churn Predictor  │ 4.1.0    │ PyTorch    │ Production │ Batch Scheduler      │ Marketing    │
  │ Product Recommender       │ 2.8.3    │ TensorFlow │ Production │ Real-time API        │ Product Team │
  │ Sentiment Analyzer        │ 1.5.2    │ HuggingFace│ Production │ Streaming Pipeline   │ CX Team      │
  │ Demand Forecaster         │ 5.0.1    │ Prophet    │ Production │ Daily Batch Job      │ Supply Chain │
  │ Credit Scoring Model      │ 3.1.0    │ XGBoost    │ Staging    │ N/A (audit pending)  │ Risk Team    │
  │ Image Classifier v2       │ 2.0.0    │ PyTorch    │ Dev        │ N/A (testing)        │ Eng Team     │
  │ NLP Intent Classifier     │ 1.3.4    │ HuggingFace│ Production │ Real-time API        │ Support Team │
  │ Dynamic Pricing Engine    │ 2.2.1    │ LightGBM   │ Production │ API Gateway (global) │ Finance Team │
  │ Anomaly Detector          │ 1.1.0    │ Isolation  │ Production │ Streaming Pipeline   │ Security     │
  │ Document Classifier       │ 3.0.2    │ spaCy      │ Production │ Batch + API          │ Legal Team   │
  │ Customer Segmentation     │ 2.5.0    │ K-Means    │ Production │ Weekly Batch Job     │ Analytics    │
  └───────────────────────────┴──────────┴────────────┴────────────┴──────────────────────┴──────────────┘

VERSION CONTROL & LINEAGE:
  Each model version captures:
    - Training dataset snapshot (with data hash: SHA-256)
    - Feature store version reference
    - Full hyperparameter configuration (JSON)
    - Training environment (Docker image hash, GPU type, Python version)
    - Training duration and resource consumption
    - Evaluation metrics (accuracy, precision, recall, F1, AUC-ROC, calibration)
    - Test set performance breakdown (by segment, by feature group)
    - Model artifact (serialized model + dependencies)
    - SBOM (Software Bill of Materials for model dependencies)
    - Approval chain: data scientist → ML engineer → MLOps → governance review
    
  Version naming convention: MAJOR.MINOR.PATCH
    MAJOR: Architectural change (new model type, new features) — requires full revalidation
    MINOR: Performance improvement (hyperparameter tuning, more data) — requires regression testing
    PATCH: Bug fix (data quality fix, edge case handling) — requires smoke testing
    
  Model lifecycle states:
    Development → Staging → Approved → Production → Deprecated → Archived
    (Each transition requires specific approvals and validations)

MODEL DEPRECATION POLICY:
  - Minimum retention: 2 versions in production (current + previous for rollback)
  - Archive after: 12 months in deprecated state (or 3 versions superseding)
  - Archive storage: S3 Glacier Deep Archive ($0.00099/GB/month)
  - Total archived models: 14 (840 GB → $0.83/month)

Model Promotion Approval Workflow

MODEL PROMOTION APPROVAL MATRIX
================================

  ┌───────────────────────────┬──────────────────┬──────────────────────┬────────────────────────┐
  │ Model Risk Tier           │ Staging → Prod   │ Required Tests       │ Approval Authority     │
  ├───────────────────────────┼──────────────────┼──────────────────────┼────────────────────────┤
  │ Tier 1 (Critical)         │ ML Review Board  │ Full validation +    │ Board + Compliance +   │
  │ Direct financial impact,  │ + Compliance     │ bias audit +         │ Legal + Business Owner │
  │ fraud, credit, insurance  │ sign-off + Legal │ explainability test  │                        │
  │ Examples: Credit Scoring, │ + Legal review   │ + security scan      │                        │
  │ Fraud Detection           │ (5-7 business    │ + penetration test   │                        │
  │                           │ days avg.)       │ + performance bench. │                        │
  ├───────────────────────────┼──────────────────┼──────────────────────┼────────────────────────┤
  │ Tier 2 (High)             │ ML Engineer +    │ Regression test +    │ ML Lead + Business     │
  │ Customer-facing, business │ Data Scientist   │ drift analysis +     │ Stakeholder            │
  │ impact                    │ peer review +    │ shadow testing       │ (2-3 business days)    │
  │ Examples: Recommender,    │ Business Owner   │ (48-72 hours)        │                        │
  │ Churn, Pricing            │ approval         │                      │                        │
  ├───────────────────────────┼──────────────────┼──────────────────────┼────────────────────────┤
  │ Tier 3 (Standard)         │ Data Scientist + │ Basic validation +   │ Peer reviewer + ML     │
  │ Internal tools, lower     │ peer reviewer    │ smoke test +         │ Engineer               │
  │ impact                    │                  │ documentation        │ (1-2 business days)    │
  │ Examples: Internal search,│                  │                      │                        │
  │ document classification   │                  │                      │                        │
  └───────────────────────────┴──────────────────┴──────────────────────┴────────────────────────┘

STAGING VALIDATION TEST SUITE:
  1. Performance Validation:
     - Model metrics meet baseline thresholds (accuracy, precision, recall)
     - No regression vs. current production model (> 1.0% degradation blocked)
     - Statistical significance test (p-value < 0.05 for improvement claims)
     - Performance by segment (no group-specific degradation)

  2. Data Quality Validation:
     - Input data schema match (column names, types, nullability)
     - Feature value range validation (min/max within expected bounds)
     - Missing value rate < threshold (per feature, total)
     - Data distribution within drift tolerance (PSI < 0.10)

  3. Infrastructure Validation:
     - Inference latency within SLA (p95 < configured threshold)
     - Memory usage within limits (OOM check)
     - GPU utilization (if applicable, no resource contention)
     - Concurrency test (sustained throughput under load)

  4. Security Validation:
     - Model artifact integrity check (signature verification)
     - Dependency vulnerability scan (Snyk, Trivy on model Docker image)
     - No sensitive data in model artifacts (PII scan)
     - Input/output sanitization (no injection, no data leakage)

  5. Bias & Fairness Validation (Tier 1 only):
     - Disparate impact analysis across protected groups
     - Equal opportunity difference < threshold
     - Demographic parity within acceptable range
     - Model card updated with fairness metrics

Model Monitoring & Drift Detection

Comprehensive Monitoring Dashboard

MODEL MONITORING DASHBOARD — Real-Time View
=============================================

Fraud Detection Engine v3.2.1 (TIER 1 — Critical)
  Deployment: Production (API Gateway, us-east-1 + eu-west-1)
  Uptime: 99.97% (47 minutes downtime in last 30 days)
  Last deployment: 2025-01-15 (28 days ago)

INFRASTRUCTURE METRICS:
  ┌────────────────────────┬──────────┬──────────┬──────────┬────────────┐
  │ Metric                 │ Current  │ p95      │ p99      │ SLA Target │
  ├────────────────────────┼──────────┼──────────┼──────────┼────────────┤
  │ Inference latency      │ 12 ms    │ 18 ms    │ 25 ms    │ < 50 ms    │
  │ Throughput             │ 12,400   │ 14,200   │ 16,800   │ > 10,000   │
  │                        │ req/min  │ req/min  │ req/min  │ req/min    │
  │ CPU utilization        │ 62%      │ 74%      │ 85%      │ < 80%      │
  │ Memory utilization     │ 58%      │ 68%      │ 78%      │ < 85%      │
  │ GPU utilization        │ 45%      │ 52%      │ 61%      │ < 70%      │
  │ Request error rate     │ 0.02%    │ 0.05%    │ 0.12%    │ < 0.1%     │
  └────────────────────────┴──────────┴──────────┴──────────┴────────────┘
  Status: ALL GREEN ✓ (all metrics within SLA)

MODEL PERFORMANCE METRICS:
  Current (last 24 hours):
    Accuracy: 97.8% | Precision: 96.2% | Recall: 98.1% | F1: 97.1% | AUC-ROC: 0.994
    
  Baseline (at deployment, 28 days ago):
    Accuracy: 97.5% | Precision: 96.0% | Recall: 97.8% | F1: 96.9% | AUC-ROC: 0.992
    
  Change: +0.3% accuracy, +0.2% precision, +0.3% recall — IMPROVING (data quality enhancement)
  Status: NO PERFORMANCE DEGRADATION ✓

  Performance by Fraud Type:
    ┌───────────────────────┬──────────┬──────────┬──────────┬────────────┐
    │ Fraud Type            │ Precision│ Recall   │ F1       │ Volume/Day │
    ├───────────────────────┼──────────┼──────────┼──────────┼────────────┤
    │ Card-not-present      │ 97.1%    │ 98.4%    │ 97.7%    │ 84,200     │
    │ Account takeover      │ 95.8%    │ 96.2%    │ 96.0%    │ 12,600     │
    │ Friendly fraud        │ 94.2%    │ 97.5%    │ 95.8%    │ 6,800      │
    │ Synthetic identity    │ 96.5%    │ 94.8%    │ 95.6%    │ 3,200      │
    │ Merchant collusion    │ 95.1%    │ 95.9%    │ 95.5%    │ 1,800      │
    └───────────────────────┴──────────┴──────────┴──────────┴────────────┘

DATA DRIFT ANALYSIS:
  Population Stability Index (PSI) by feature group:
    Customer features: PSI = 0.02 (threshold: 0.10) — STABLE ✓
    Transaction features: PSI = 0.04 (threshold: 0.10) — STABLE ✓
    Temporal features: PSI = 0.01 (threshold: 0.10) — STABLE ✓
    Behavioral features: PSI = 0.03 (threshold: 0.10) — STABLE ✓
    Device features: PSI = 0.06 (threshold: 0.10) — APPROACHING ✓
    
  Input distribution shift:
    Kolmogorov-Smirnov test p-value: 0.38 (threshold: 0.05) — NO SIGNIFICANT SHIFT ✓
    
  Missing value rate:
    Current: 0.3% (baseline: 0.2%) — ACCEPTABLE (within 0.5% tolerance)
    
  New categories detected: 2 (new device types from Android 15) — LOW RISK
    Action: Logged, will be included in next training cycle

CONCEPT DRIFT:
  Prediction distribution PSI: 0.01 — STABLE ✓
  Label distribution shift: < 0.5% — NO SIGNIFICANT CHANGE ✓
  Performance decay: 0.0% — NO DEGRADATION ✓
  
  Calibration check (Brier score):
    Current: 0.021 (baseline: 0.022) — WELL-CALIBRATED ✓

ALERTS (Last 7 Days):
  0 critical alerts
  2 warning alerts (both auto-resolved):
    - GPU memory spike to 78% (transient, resolved in 4 minutes)
    - Input latency increase to 22ms (network blip, resolved in 2 minutes)
  0 drift alerts
  0 performance degradation alerts

Drift Detection & Auto-Retraining Configuration

DRIFT DETECTION & RETRAINING TRIGGERS
=======================================

CONFIGURATION BY MODEL RISK TIER:

  ┌────────────────────────────┬────────────────┬────────────────┬─────────────────────┐
  │ Trigger Type               │ Tier 1         │ Tier 2         │ Tier 3              │
  ├────────────────────────────┼────────────────┼────────────────┼─────────────────────┤
  │ Accuracy decay             │ > 0.5%         │ > 1.0%         │ > 2.0%              │
  │ Precision decay            │ > 0.5%         │ > 1.0%         │ > 2.0%              │
  │ Recall decay               │ > 1.0%         │ > 2.0%         │ > 3.0%              │
  │ Data drift PSI             │ > 0.05         │ > 0.10         │ > 0.15              │
  │ Concept drift PSI          │ > 0.03         │ > 0.05         │ > 0.10              │
  │ Missing value increase     │ > 0.3%         │ > 0.5%         │ > 1.0%              │
  │ Input schema change        │ Immediate      │ Immediate      │ Next business day   │
  │ Scheduled retraining       │ Weekly         │ Bi-weekly      │ Monthly             │
  │ New data volume trigger    │ 50K records    │ 100K records   │ 250K records        │
  │ Calibration drift (Brier)  │ > 0.005        │ > 0.010        │ > 0.020             │
  └────────────────────────────┴────────────────┴────────────────┴─────────────────────┘

RETRAINING PIPELINE — FRAUD DETECTION (Tier 1 Example):

  Phase 1: Data Collection & Validation (10-15 minutes)
    - Pull training data from data warehouse (last 12 months, 2.4M records)
    - Validate data quality:
      * Schema validation (all required columns present, correct types)
      * Null rate check (< 1% per feature)
      * Outlier detection (IQR method, flag > 3x IQR)
      * Temporal validation (no future data leakage)
    - Generate data quality report (stored with model version)
    - Compare data distribution vs. last training (PSI per feature)
    
  Phase 2: Feature Engineering (15-30 minutes)
    - Extract features from feature store (version: features-v4.2.1)
    - Apply transformations:
      * Numerical: log transform (skewed), min-max scaling (bounded)
      * Categorical: target encoding (high cardinality), one-hot (low cardinality)
      * Temporal: rolling windows (7-day, 30-day, 90-day aggregates)
      * Behavioral: session-based features (login frequency, transaction patterns)
    - Feature selection: mutual information ranking (top 45 of 120 features)
    - Feature store write: new features version (v4.2.2)
    
  Phase 3: Model Training (45 minutes - 2 hours)
    - Training framework: XGBoost (v2.0.3)
    - Hyperparameters (auto-tuned via Optuna, 50 trials):
      * max_depth: 8
      * learning_rate: 0.05
      * n_estimators: 500
      * subsample: 0.8
      * colsample_bytree: 0.8
      * reg_alpha: 0.1
      * reg_lambda: 1.0
    - Cross-validation: 5-fold time-series CV (no random shuffle)
    - Training duration: 1.2 hours on p3.2xlarge (4 vCPU, 61 GB, 1x V100 GPU)
    - Resource cost: $3.20 (on-demand) or $0.64 (spot, 80% savings)
    
  Phase 4: Validation & Comparison (10-15 minutes)
    - Holdout test set: 20% of data (last 8 weeks, temporally split)
    - Performance vs. current production model:
      * Must improve OR meet baseline (no regression allowed)
      * Statistical significance test (paired t-test, p < 0.05)
      * Segment-level validation (no group-specific regression)
    - Bias testing (Tier 1 requirement):
      * Disparate impact ratio across fraud categories
      * Equal opportunity difference < 0.05
    - Model card update: new metrics, new features, known limitations
    
  Phase 5: Staging Deployment & Shadow Testing (1-2 hours)
    - Deploy to staging environment (isolated, mirrors production)
    - Shadow testing: process real traffic alongside production model
    - Compare predictions: agreement rate, disagreement analysis
    - Performance benchmarks: latency, throughput, resource usage
    - Security scan: dependency vulnerabilities, artifact integrity
    
  Phase 6: Approval & Production Promotion (1-24 hours)
    - Auto-approve if: improvement > 1.0% AND all tests pass AND no drift
    - Manual approval if: marginal improvement OR new features introduced
    - Promotion strategy: canary deployment (10% → 25% → 50% → 100%)
    - Rollback criteria: error rate > 0.5% OR accuracy drop > 0.5%
    
  Total Pipeline Duration: 2.5-6 hours (typical: 3.5 hours)
  Pipeline frequency: Weekly (Sunday 2:00 AM UTC)
  Pipeline success rate: 96.2% (last 52 runs, 50 successful, 2 failed, 1 manual override)

RETRAINING DECISION LOG (Last 8 Weeks):
  Week 1: Retrained — accuracy +0.2%, precision +0.1% (promoted to production)
  Week 2: Retrained — no improvement (0.05% accuracy change, < 0.5% threshold) — kept existing
  Week 3: Retrained — accuracy +0.4%, new feature added (device fingerprint v2) — promoted
  Week 4: Retrained — data quality issue (null spike in geo field) — pipeline failed, retried Week 5
  Week 5: Retrained (redo) — accuracy +0.3% — promoted
  Week 6: Retrained — no improvement — kept existing (saves $3.20 training cost)
  Week 7: Retrained — accuracy +0.1%, recall +0.3% — promoted (recall improvement valuable)
  Week 8: Retrained — accuracy +0.5%, precision +0.3% — promoted (significant improvement)
  
  Models promoted: 5/8 weeks (62.5%)
  Models kept (no improvement): 2/8 weeks (25%)
  Pipeline failures: 1/8 weeks (12.5%, resolved on retry)
  Cost optimization: $6.40 saved by skipping unnecessary promotions

Feature Store Management

Enterprise Feature Store Architecture

FEATURE STORE — ENTERPRISE ARCHITECTURE
=========================================

Platform: Feast (self-managed on Kubernetes) + Redis (online) + S3 + Glue (offline)
Total Features: 342 (across all model domains)

FEATURE INVENTORY BY DOMAIN:
  ┌──────────────────────┬────────┬────────────────────────┬─────────────────────┬──────────────────┐
  │ Domain               │ Count  │ Source Systems         │ Refresh Frequency   │ Storage Type     │
  ├──────────────────────┼────────┼────────────────────────┼─────────────────────┼──────────────────┤
  │ Customer profile     │ 48     │ CRM (Salesforce), Auth │ Real-time (< 5 min) │ Online (Redis)   │
  │ Transaction history  │ 85     │ Payment processor,     │ Real-time (< 1 min) │ Online (Redis)   │
  │                      │        │ banking APIs           │                     │                  │
  │ Product catalog      │ 62     │ Catalog DB, Inventory  │ Hourly              │ Hybrid           │
  │ Behavioral signals   │ 73     │ Clickstream,           │ 5-minute window     │ Online (Redis)   │
  │                      │        │ session tracking       │                     │                  │
  │ Temporal aggregates  │ 45     │ Data warehouse         │ Daily (batch)       │ Offline (S3)     │
  │                      │        │ (Snowflake)            │                     │                  │
  │ External data        │ 29     │ Third-party APIs       │ Hourly-Daily        │ Hybrid           │
  │                      │        │ (credit bureaus,       │                     │                  │
  │                      │        │ weather, geo)          │                     │                  │
  └──────────────────────┴────────┴────────────────────────┴─────────────────────┴──────────────────┘

ONLINE FEATURE STORE (Redis Cluster):
  Nodes: 6 (3 primary + 3 replica, 3 AZs)
  Memory: 256 GB total (148 GB used, 58% utilization)
  Features stored: 288 (real-time features)
  Average read latency: 2.1 ms (p95: 4.8 ms)
  Throughput: 45,000 reads/second (peak: 78,000 req/sec)
  Eviction policy: LRU (14-day TTL for low-access features)
  Cost: $2,840/month (6x r6g.xlarge Redis Enterprise)

OFFLINE FEATURE STORE (S3 + Glue):
  Storage: 4.2 TB (342 features × historical data)
  Format: Parquet (partitioned by date, compressed with ZSTD)
  Query engine: AWS Athena / Presto on EMR
  Use cases: Model training, backtesting, feature exploration
  Cost: $28/month (S3 Standard) + $12/month (Athena queries)

FEATURE GOVERNANCE:
  Feature ownership:
    - Each feature has a designated owner (data engineer or data scientist)
    - Owner responsible for: data quality, freshness, documentation, deprecation
    - 342 features → 28 feature owners (avg. 12 features per owner)
    
  Feature documentation:
    - Description: What the feature represents, business meaning
    - Data type: int, float, string, boolean, enum (with allowed values)
    - Source: Original system, extraction method, transformation logic
    - Refresh: Frequency, latency, last update timestamp
    - Quality: Freshness SLA, completeness %, null rate %, accuracy %
    - Dependencies: Upstream features, source systems
    - Consumers: Models using this feature (bidirectional mapping)
    
  Feature versioning:
    - Semantic versioning: MAJOR.MINOR.PATCH
    - MAJOR: Breaking change (type change, logic change)
    - MINOR: Additive change (new derived feature)
    - PATCH: Fix (data quality correction)
    - Current: 342 active features, 88 deprecated (3-month grace period)
    
  Feature quality SLAs:
    Freshness: Real-time features < 5 min, hourly < 65 min, daily < 25 hours
    Completeness: > 99% (null rate < 1%)
    Accuracy: Validated against source system (daily spot check)
    Availability: 99.95% SLA (automated alert on outage)

FEATURE USAGE ANALYTICS:
  Most-used features (by model consumption):
    1. customer_age_days (24 models) — Customer domain
    2. transaction_count_7d (18 models) — Transaction domain
    3. device_fingerprint_hash (16 models) — Behavioral domain
    4. account_opening_date (15 models) — Customer domain
    5. avg_transaction_amount_30d (14 models) — Transaction domain
    
  Feature adoption rate: 87% (298 of 342 features used by at least one model)
  Unused features (> 90 days): 12 (flagged for deprecation review)
  Feature creation rate: 8-12 new features per month (steady growth)

Model Governance & Compliance

Model Risk Management Framework

MODEL GOVERNANCE — ENTERPRISE FRAMEWORK
=========================================

MODEL RISK CLASSIFICATION:
  ┌──────────────────────────┬────────────────────────────────────┬────────────────────┬──────────────────────┐
  │ Tier                     │ Criteria                           │ Models (Count)     │ Examples             │
  ├──────────────────────────┼────────────────────────────────────┼────────────────────┼──────────────────────┤
  │ Tier 1 — Critical        │ Direct financial impact, regulatory│ 8 (17% of total)   │ Credit Scoring,      │
  │                          │ requirement, fraud detection       │                    │ Fraud Detection,     │
  │                          │                                    │                    │ Insurance Underwriting│
  ├──────────────────────────┼────────────────────────────────────┼────────────────────┼──────────────────────┤
  │ Tier 2 — High            │ Customer-facing decisions,         │ 16 (34% of total)  │ Product Recommender, │
  │                          │ significant business impact,       │                    │ Churn Prediction,    │
  │                          │ revenue-driving                    │                    │ Dynamic Pricing      │
  ├──────────────────────────┼────────────────────────────────────┼────────────────────┼──────────────────────┤
  │ Tier 3 — Standard        │ Internal tools, lower impact,      │ 18 (38% of total)  │ Document Classifier, │
  │                          │ no direct customer/financial impact│                    │ Internal Search,     │
  │                          │                                    │                    │ Code Recommender     │
  ├──────────────────────────┼────────────────────────────────────┼────────────────────┼──────────────────────┤
  │ Tier 4 — Experimental    │ Research, prototyping, A/B testing │ 5 (11% of total)   │ New model experiments│
  │                          │                                    │                    │                      │
  └──────────────────────────┴────────────────────────────────────┴────────────────────┴──────────────────────┘

GOVERNANCE REQUIREMENTS BY TIER:

  TIER 1 (CRITICAL):
    Model Card: Required (comprehensive, reviewed quarterly)
      - Purpose and intended use
      - Training data description and limitations
      - Performance metrics (overall + by demographic segment)
      - Known limitations and failure modes
      - Ethical considerations and risk assessment
      - Environmental impact (training compute carbon footprint)
    
    Bias Testing: Mandatory (before deployment + quarterly retest)
      - Disparate impact analysis (protected classes: race, gender, age)
      - Equal opportunity difference < 0.05
      - Calibration by group (Brier score per segment)
      - Counterfactual fairness testing
    
    Explainability: Required (SHAP + LIME + counterfactual explanations)
      - Global feature importance (top 10 features driving predictions)
      - Local explainability (per-prediction feature attribution)
      - Counterfactual explanations ("what would need to change for different outcome")
      - Explanation quality score (stability, fidelity, sparsity)
    
    Monitoring: Real-time with < 1 hour alert SLA
      - Performance metrics (accuracy, precision, recall) — 15-minute windows
      - Data drift (PSI) — hourly calculation
      - Concept drift — daily calculation
      - Fairness metrics — weekly calculation
      - Alert escalation: ML Engineer → MLOps Lead → VP Engineering → CISO
    
    Retraining: Mandatory quarterly review + event-driven
      - Quarterly: Full retraining with governance review
      - Event-driven: Drift triggers (see retraining triggers section)
      - Annual: Full model audit (independent review by risk team)
    
    Approval: ML Review Board (5 members) + Compliance Officer + Legal Counsel
      - Board: CTO, Head of Data Science, Head of ML Engineering, Head of Risk, External Advisor
      - Quorum: 4 of 5 members required for decision
      - Documentation: Board minutes, voting record, dissenting opinions
      - Timeline: 5-7 business days for review
    
    Audit Trail: Full lineage from data → feature → model → prediction
      - Data provenance: Source systems, extraction timestamp, data hash
      - Feature lineage: Transformation logic, feature store version
      - Model lineage: Training run ID, hyperparameters, evaluation results
      - Prediction lineage: Input data hash, model version, output, timestamp
      - Retention: 7 years (regulatory requirement)
    
    Regulatory Compliance:
      - ECOA (Equal Credit Opportunity Act): Adverse action notice generation
      - FCRA (Fair Credit Reporting Act): Consumer dispute handling
      - GDPR Article 22: Right to explanation for automated decisions
      - CCPA: Right to opt-out of automated decision-making
      - Sector-specific: GLBA (banking), HIPAA (healthcare), ERISA (insurance)

  TIER 2 (HIGH):
    Model Card: Required (standard template, reviewed annually)
    Bias Testing: Recommended (before deployment, annual retest)
    Explainability: Feature importance scores (global + local)
    Monitoring: Hourly performance checks + daily drift analysis
    Retraining: Monthly evaluation + event-driven
    Approval: ML Engineering Lead + Business Stakeholder
    Audit Trail: Standard lineage (3-year retention)

  TIER 3 (STANDARD):
    Model Card: Documentation (lightweight, updated at deployment)
    Bias Testing: Not required (unless sensitive data involved)
    Explainability: Model documentation sufficient
    Monitoring: Daily performance checks + weekly drift analysis
    Retraining: Quarterly evaluation
    Approval: Peer review + ML Engineer sign-off
    Audit Trail: Basic lineage (1-year retention)

ANNUAL MODEL AUDIT PROGRAM:
  Scope: All Tier 1 models + random sample of Tier 2 models (20%)
  Auditor: Independent internal audit team (not MLOps or data science)
  Frequency: Annual (full audit) + quarterly (sampling)
  Audit areas:
    1. Model performance (actual vs. stated metrics)
    2. Data quality and provenance
    3. Bias and fairness (retest with audit methodology)
    4. Monitoring effectiveness (alert response times, resolution)
    5. Governance compliance (approvals, documentation, audit trail)
    6. Security (model artifact integrity, access controls)
    7. Business value (ROI, cost vs. benefit analysis)
  Last audit: Q4 2024
  Findings: 3 observations (all informational, no exceptions)
    1. Model card update cadence inconsistent (2 models) — remediation: automated reminder
    2. Feature deprecation timeline exceeded (1 feature) — remediation: enforced 30-day policy
    3. Shadow testing duration too short (1 model) — remediation: extended to 72 hours

Integration Points

Edge Cases