IT AI Skill

Ai Ml Operations

Manage AI/ML model lifecycle including model deployment orchestration, feature store management, model monitoring and drift detection, automated retraining pipelines, model governance and compliance, A/B testing frameworks, and ML cost optimization. Use when deploying ML models to production, monitoring model performance degradation, managing feature stores, automating retraining cycles, establishing ML governance frameworks, conducting model audits, or optimizing ML infrastructure costs. Triggers on phrases like "ML operations", "MLOps", "model deployment", "model monitoring", "feature store", "model drift", "automated retraining", "ML governance", "model registry", "inference optimization", "model versioning", "ML pipeline", "model audit", "canary deployment", "model rollback".

AI/ML Operations

Deploy, monitor, govern, and optimize machine learning models at enterprise scale.

Workflow

Establish MLOps foundation: model registry, CI/CD pipeline, monitoring infrastructure, governance framework.
Design training pipeline: data ingestion, feature engineering, model training, validation, registration.
Implement deployment strategy: canary, A/B, shadow, or blue-green deployment with automated rollback.
Set up continuous monitoring: performance metrics, data drift, concept drift, infrastructure health.
Configure automated retraining: trigger conditions, pipeline orchestration, validation gates, approval workflow.
Establish governance: model cards, bias testing, explainability, audit trails, regulatory compliance.
Optimize costs: inference compute, feature storage, training resources, pipeline efficiency.
Scale operations: multi-model management, team enablement, self-service ML platform.

Model Registry & Versioning

Enterprise Model Registry Architecture

MODEL REGISTRY — ENTERPRISE ARCHITECTURE
==========================================

Registry Platform: MLflow + Custom Extensions (or SageMaker Model Registry, W&B)
Total Registered Models: 47 (across all teams)

MODEL INVENTORY:
  ┌───────────────────────────┬──────────┬────────────┬────────────┬──────────────────────┬──────────────┐
  │ Model Name                │ Version  │ Framework  │ Status     │ Deployed To          │ Owner        │
  ├───────────────────────────┼──────────┼────────────┼────────────┼──────────────────────┼──────────────┤
  │ Fraud Detection Engine    │ 3.2.1    │ XGBoost    │ Production │ API Gateway (us-east)│ Risk Team    │
  │ Customer Churn Predictor  │ 4.1.0    │ PyTorch    │ Production │ Batch Scheduler      │ Marketing    │
  │ Product Recommender       │ 2.8.3    │ TensorFlow │ Production │ Real-time API        │ Product Team │
  │ Sentiment Analyzer        │ 1.5.2    │ HuggingFace│ Production │ Streaming Pipeline   │ CX Team      │
  │ Demand Forecaster         │ 5.0.1    │ Prophet    │ Production │ Daily Batch Job      │ Supply Chain │
  │ Credit Scoring Model      │ 3.1.0    │ XGBoost    │ Staging    │ N/A (audit pending)  │ Risk Team    │
  │ Image Classifier v2       │ 2.0.0    │ PyTorch    │ Dev        │ N/A (testing)        │ Eng Team     │
  │ NLP Intent Classifier     │ 1.3.4    │ HuggingFace│ Production │ Real-time API        │ Support Team │
  │ Dynamic Pricing Engine    │ 2.2.1    │ LightGBM   │ Production │ API Gateway (global) │ Finance Team │
  │ Anomaly Detector          │ 1.1.0    │ Isolation  │ Production │ Streaming Pipeline   │ Security     │
  │ Document Classifier       │ 3.0.2    │ spaCy      │ Production │ Batch + API          │ Legal Team   │
  │ Customer Segmentation     │ 2.5.0    │ K-Means    │ Production │ Weekly Batch Job     │ Analytics    │
  └───────────────────────────┴──────────┴────────────┴────────────┴──────────────────────┴──────────────┘

VERSION CONTROL & LINEAGE:
  Each model version captures:
    - Training dataset snapshot (with data hash: SHA-256)
    - Feature store version reference
    - Full hyperparameter configuration (JSON)
    - Training environment (Docker image hash, GPU type, Python version)
    - Training duration and resource consumption
    - Evaluation metrics (accuracy, precision, recall, F1, AUC-ROC, calibration)
    - Test set performance breakdown (by segment, by feature group)
    - Model artifact (serialized model + dependencies)
    - SBOM (Software Bill of Materials for model dependencies)
    - Approval chain: data scientist → ML engineer → MLOps → governance review
    
  Version naming convention: MAJOR.MINOR.PATCH
    MAJOR: Architectural change (new model type, new features) — requires full revalidation
    MINOR: Performance improvement (hyperparameter tuning, more data) — requires regression testing
    PATCH: Bug fix (data quality fix, edge case handling) — requires smoke testing
    
  Model lifecycle states:
    Development → Staging → Approved → Production → Deprecated → Archived
    (Each transition requires specific approvals and validations)

MODEL DEPRECATION POLICY:
  - Minimum retention: 2 versions in production (current + previous for rollback)
  - Archive after: 12 months in deprecated state (or 3 versions superseding)
  - Archive storage: S3 Glacier Deep Archive ($0.00099/GB/month)
  - Total archived models: 14 (840 GB → $0.83/month)

Model Promotion Approval Workflow

MODEL PROMOTION APPROVAL MATRIX
================================

  ┌───────────────────────────┬──────────────────┬──────────────────────┬────────────────────────┐
  │ Model Risk Tier           │ Staging → Prod   │ Required Tests       │ Approval Authority     │
  ├───────────────────────────┼──────────────────┼──────────────────────┼────────────────────────┤
  │ Tier 1 (Critical)         │ ML Review Board  │ Full validation +    │ Board + Compliance +   │
  │ Direct financial impact,  │ + Compliance     │ bias audit +         │ Legal + Business Owner │
  │ fraud, credit, insurance  │ sign-off + Legal │ explainability test  │                        │
  │ Examples: Credit Scoring, │ + Legal review   │ + security scan      │                        │
  │ Fraud Detection           │ (5-7 business    │ + penetration test   │                        │
  │                           │ days avg.)       │ + performance bench. │                        │
  ├───────────────────────────┼──────────────────┼──────────────────────┼────────────────────────┤
  │ Tier 2 (High)             │ ML Engineer +    │ Regression test +    │ ML Lead + Business     │
  │ Customer-facing, business │ Data Scientist   │ drift analysis +     │ Stakeholder            │
  │ impact                    │ peer review +    │ shadow testing       │ (2-3 business days)    │
  │ Examples: Recommender,    │ Business Owner   │ (48-72 hours)        │                        │
  │ Churn, Pricing            │ approval         │                      │                        │
  ├───────────────────────────┼──────────────────┼──────────────────────┼────────────────────────┤
  │ Tier 3 (Standard)         │ Data Scientist + │ Basic validation +   │ Peer reviewer + ML     │
  │ Internal tools, lower     │ peer reviewer    │ smoke test +         │ Engineer               │
  │ impact                    │                  │ documentation        │ (1-2 business days)    │
  │ Examples: Internal search,│                  │                      │                        │
  │ document classification   │                  │                      │                        │
  └───────────────────────────┴──────────────────┴──────────────────────┴────────────────────────┘

STAGING VALIDATION TEST SUITE:
  1. Performance Validation:
     - Model metrics meet baseline thresholds (accuracy, precision, recall)
     - No regression vs. current production model (> 1.0% degradation blocked)
     - Statistical significance test (p-value < 0.05 for improvement claims)
     - Performance by segment (no group-specific degradation)

  2. Data Quality Validation:
     - Input data schema match (column names, types, nullability)
     - Feature value range validation (min/max within expected bounds)
     - Missing value rate < threshold (per feature, total)
     - Data distribution within drift tolerance (PSI < 0.10)

  3. Infrastructure Validation:
     - Inference latency within SLA (p95 < configured threshold)
     - Memory usage within limits (OOM check)
     - GPU utilization (if applicable, no resource contention)
     - Concurrency test (sustained throughput under load)

  4. Security Validation:
     - Model artifact integrity check (signature verification)
     - Dependency vulnerability scan (Snyk, Trivy on model Docker image)
     - No sensitive data in model artifacts (PII scan)
     - Input/output sanitization (no injection, no data leakage)

  5. Bias & Fairness Validation (Tier 1 only):
     - Disparate impact analysis across protected groups
     - Equal opportunity difference < threshold
     - Demographic parity within acceptable range
     - Model card updated with fairness metrics

Model Monitoring & Drift Detection

Comprehensive Monitoring Dashboard

MODEL MONITORING DASHBOARD — Real-Time View
=============================================

Fraud Detection Engine v3.2.1 (TIER 1 — Critical)
  Deployment: Production (API Gateway, us-east-1 + eu-west-1)
  Uptime: 99.97% (47 minutes downtime in last 30 days)
  Last deployment: 2025-01-15 (28 days ago)

INFRASTRUCTURE METRICS:
  ┌────────────────────────┬──────────┬──────────┬──────────┬────────────┐
  │ Metric                 │ Current  │ p95      │ p99      │ SLA Target │
  ├────────────────────────┼──────────┼──────────┼──────────┼────────────┤
  │ Inference latency      │ 12 ms    │ 18 ms    │ 25 ms    │ < 50 ms    │
  │ Throughput             │ 12,400   │ 14,200   │ 16,800   │ > 10,000   │
  │                        │ req/min  │ req/min  │ req/min  │ req/min    │
  │ CPU utilization        │ 62%      │ 74%      │ 85%      │ < 80%      │
  │ Memory utilization     │ 58%      │ 68%      │ 78%      │ < 85%      │
  │ GPU utilization        │ 45%      │ 52%      │ 61%      │ < 70%      │
  │ Request error rate     │ 0.02%    │ 0.05%    │ 0.12%    │ < 0.1%     │
  └────────────────────────┴──────────┴──────────┴──────────┴────────────┘
  Status: ALL GREEN ✓ (all metrics within SLA)

MODEL PERFORMANCE METRICS:
  Current (last 24 hours):
    Accuracy: 97.8% | Precision: 96.2% | Recall: 98.1% | F1: 97.1% | AUC-ROC: 0.994
    
  Baseline (at deployment, 28 days ago):
    Accuracy: 97.5% | Precision: 96.0% | Recall: 97.8% | F1: 96.9% | AUC-ROC: 0.992
    
  Change: +0.3% accuracy, +0.2% precision, +0.3% recall — IMPROVING (data quality enhancement)
  Status: NO PERFORMANCE DEGRADATION ✓

  Performance by Fraud Type:
    ┌───────────────────────┬──────────┬──────────┬──────────┬────────────┐
    │ Fraud Type            │ Precision│ Recall   │ F1       │ Volume/Day │
    ├───────────────────────┼──────────┼──────────┼──────────┼────────────┤
    │ Card-not-present      │ 97.1%    │ 98.4%    │ 97.7%    │ 84,200     │
    │ Account takeover      │ 95.8%    │ 96.2%    │ 96.0%    │ 12,600     │
    │ Friendly fraud        │ 94.2%    │ 97.5%    │ 95.8%    │ 6,800      │
    │ Synthetic identity    │ 96.5%    │ 94.8%    │ 95.6%    │ 3,200      │
    │ Merchant collusion    │ 95.1%    │ 95.9%    │ 95.5%    │ 1,800      │
    └───────────────────────┴──────────┴──────────┴──────────┴────────────┘

DATA DRIFT ANALYSIS:
  Population Stability Index (PSI) by feature group:
    Customer features: PSI = 0.02 (threshold: 0.10) — STABLE ✓
    Transaction features: PSI = 0.04 (threshold: 0.10) — STABLE ✓
    Temporal features: PSI = 0.01 (threshold: 0.10) — STABLE ✓
    Behavioral features: PSI = 0.03 (threshold: 0.10) — STABLE ✓
    Device features: PSI = 0.06 (threshold: 0.10) — APPROACHING ✓
    
  Input distribution shift:
    Kolmogorov-Smirnov test p-value: 0.38 (threshold: 0.05) — NO SIGNIFICANT SHIFT ✓
    
  Missing value rate:
    Current: 0.3% (baseline: 0.2%) — ACCEPTABLE (within 0.5% tolerance)
    
  New categories detected: 2 (new device types from Android 15) — LOW RISK
    Action: Logged, will be included in next training cycle

CONCEPT DRIFT:
  Prediction distribution PSI: 0.01 — STABLE ✓
  Label distribution shift: < 0.5% — NO SIGNIFICANT CHANGE ✓
  Performance decay: 0.0% — NO DEGRADATION ✓
  
  Calibration check (Brier score):
    Current: 0.021 (baseline: 0.022) — WELL-CALIBRATED ✓

ALERTS (Last 7 Days):
  0 critical alerts
  2 warning alerts (both auto-resolved):
    - GPU memory spike to 78% (transient, resolved in 4 minutes)
    - Input latency increase to 22ms (network blip, resolved in 2 minutes)
  0 drift alerts
  0 performance degradation alerts

Drift Detection & Auto-Retraining Configuration

DRIFT DETECTION & RETRAINING TRIGGERS
=======================================

CONFIGURATION BY MODEL RISK TIER:

  ┌────────────────────────────┬────────────────┬────────────────┬─────────────────────┐
  │ Trigger Type               │ Tier 1         │ Tier 2         │ Tier 3              │
  ├────────────────────────────┼────────────────┼────────────────┼─────────────────────┤
  │ Accuracy decay             │ > 0.5%         │ > 1.0%         │ > 2.0%              │
  │ Precision decay            │ > 0.5%         │ > 1.0%         │ > 2.0%              │
  │ Recall decay               │ > 1.0%         │ > 2.0%         │ > 3.0%              │
  │ Data drift PSI             │ > 0.05         │ > 0.10         │ > 0.15              │
  │ Concept drift PSI          │ > 0.03         │ > 0.05         │ > 0.10              │
  │ Missing value increase     │ > 0.3%         │ > 0.5%         │ > 1.0%              │
  │ Input schema change        │ Immediate      │ Immediate      │ Next business day   │
  │ Scheduled retraining       │ Weekly         │ Bi-weekly      │ Monthly             │
  │ New data volume trigger    │ 50K records    │ 100K records   │ 250K records        │
  │ Calibration drift (Brier)  │ > 0.005        │ > 0.010        │ > 0.020             │
  └────────────────────────────┴────────────────┴────────────────┴─────────────────────┘

RETRAINING PIPELINE — FRAUD DETECTION (Tier 1 Example):

  Phase 1: Data Collection & Validation (10-15 minutes)
    - Pull training data from data warehouse (last 12 months, 2.4M records)
    - Validate data quality:
      * Schema validation (all required columns present, correct types)
      * Null rate check (< 1% per feature)
      * Outlier detection (IQR method, flag > 3x IQR)
      * Temporal validation (no future data leakage)
    - Generate data quality report (stored with model version)
    - Compare data distribution vs. last training (PSI per feature)
    
  Phase 2: Feature Engineering (15-30 minutes)
    - Extract features from feature store (version: features-v4.2.1)
    - Apply transformations:
      * Numerical: log transform (skewed), min-max scaling (bounded)
      * Categorical: target encoding (high cardinality), one-hot (low cardinality)
      * Temporal: rolling windows (7-day, 30-day, 90-day aggregates)
      * Behavioral: session-based features (login frequency, transaction patterns)
    - Feature selection: mutual information ranking (top 45 of 120 features)
    - Feature store write: new features version (v4.2.2)
    
  Phase 3: Model Training (45 minutes - 2 hours)
    - Training framework: XGBoost (v2.0.3)
    - Hyperparameters (auto-tuned via Optuna, 50 trials):
      * max_depth: 8
      * learning_rate: 0.05
      * n_estimators: 500
      * subsample: 0.8
      * colsample_bytree: 0.8
      * reg_alpha: 0.1
      * reg_lambda: 1.0
    - Cross-validation: 5-fold time-series CV (no random shuffle)
    - Training duration: 1.2 hours on p3.2xlarge (4 vCPU, 61 GB, 1x V100 GPU)
    - Resource cost: $3.20 (on-demand) or $0.64 (spot, 80% savings)
    
  Phase 4: Validation & Comparison (10-15 minutes)
    - Holdout test set: 20% of data (last 8 weeks, temporally split)
    - Performance vs. current production model:
      * Must improve OR meet baseline (no regression allowed)
      * Statistical significance test (paired t-test, p < 0.05)
      * Segment-level validation (no group-specific regression)
    - Bias testing (Tier 1 requirement):
      * Disparate impact ratio across fraud categories
      * Equal opportunity difference < 0.05
    - Model card update: new metrics, new features, known limitations
    
  Phase 5: Staging Deployment & Shadow Testing (1-2 hours)
    - Deploy to staging environment (isolated, mirrors production)
    - Shadow testing: process real traffic alongside production model
    - Compare predictions: agreement rate, disagreement analysis
    - Performance benchmarks: latency, throughput, resource usage
    - Security scan: dependency vulnerabilities, artifact integrity
    
  Phase 6: Approval & Production Promotion (1-24 hours)
    - Auto-approve if: improvement > 1.0% AND all tests pass AND no drift
    - Manual approval if: marginal improvement OR new features introduced
    - Promotion strategy: canary deployment (10% → 25% → 50% → 100%)
    - Rollback criteria: error rate > 0.5% OR accuracy drop > 0.5%
    
  Total Pipeline Duration: 2.5-6 hours (typical: 3.5 hours)
  Pipeline frequency: Weekly (Sunday 2:00 AM UTC)
  Pipeline success rate: 96.2% (last 52 runs, 50 successful, 2 failed, 1 manual override)

RETRAINING DECISION LOG (Last 8 Weeks):
  Week 1: Retrained — accuracy +0.2%, precision +0.1% (promoted to production)
  Week 2: Retrained — no improvement (0.05% accuracy change, < 0.5% threshold) — kept existing
  Week 3: Retrained — accuracy +0.4%, new feature added (device fingerprint v2) — promoted
  Week 4: Retrained — data quality issue (null spike in geo field) — pipeline failed, retried Week 5
  Week 5: Retrained (redo) — accuracy +0.3% — promoted
  Week 6: Retrained — no improvement — kept existing (saves $3.20 training cost)
  Week 7: Retrained — accuracy +0.1%, recall +0.3% — promoted (recall improvement valuable)
  Week 8: Retrained — accuracy +0.5%, precision +0.3% — promoted (significant improvement)
  
  Models promoted: 5/8 weeks (62.5%)
  Models kept (no improvement): 2/8 weeks (25%)
  Pipeline failures: 1/8 weeks (12.5%, resolved on retry)
  Cost optimization: $6.40 saved by skipping unnecessary promotions

Feature Store Management

Enterprise Feature Store Architecture

FEATURE STORE — ENTERPRISE ARCHITECTURE
=========================================

Platform: Feast (self-managed on Kubernetes) + Redis (online) + S3 + Glue (offline)
Total Features: 342 (across all model domains)

FEATURE INVENTORY BY DOMAIN:
  ┌──────────────────────┬────────┬────────────────────────┬─────────────────────┬──────────────────┐
  │ Domain               │ Count  │ Source Systems         │ Refresh Frequency   │ Storage Type     │
  ├──────────────────────┼────────┼────────────────────────┼─────────────────────┼──────────────────┤
  │ Customer profile     │ 48     │ CRM (Salesforce), Auth │ Real-time (< 5 min) │ Online (Redis)   │
  │ Transaction history  │ 85     │ Payment processor,     │ Real-time (< 1 min) │ Online (Redis)   │
  │                      │        │ banking APIs           │                     │                  │
  │ Product catalog      │ 62     │ Catalog DB, Inventory  │ Hourly              │ Hybrid           │
  │ Behavioral signals   │ 73     │ Clickstream,           │ 5-minute window     │ Online (Redis)   │
  │                      │        │ session tracking       │                     │                  │
  │ Temporal aggregates  │ 45     │ Data warehouse         │ Daily (batch)       │ Offline (S3)     │
  │                      │        │ (Snowflake)            │                     │                  │
  │ External data        │ 29     │ Third-party APIs       │ Hourly-Daily        │ Hybrid           │
  │                      │        │ (credit bureaus,       │                     │                  │
  │                      │        │ weather, geo)          │                     │                  │
  └──────────────────────┴────────┴────────────────────────┴─────────────────────┴──────────────────┘

ONLINE FEATURE STORE (Redis Cluster):
  Nodes: 6 (3 primary + 3 replica, 3 AZs)
  Memory: 256 GB total (148 GB used, 58% utilization)
  Features stored: 288 (real-time features)
  Average read latency: 2.1 ms (p95: 4.8 ms)
  Throughput: 45,000 reads/second (peak: 78,000 req/sec)
  Eviction policy: LRU (14-day TTL for low-access features)
  Cost: $2,840/month (6x r6g.xlarge Redis Enterprise)

OFFLINE FEATURE STORE (S3 + Glue):
  Storage: 4.2 TB (342 features × historical data)
  Format: Parquet (partitioned by date, compressed with ZSTD)
  Query engine: AWS Athena / Presto on EMR
  Use cases: Model training, backtesting, feature exploration
  Cost: $28/month (S3 Standard) + $12/month (Athena queries)

FEATURE GOVERNANCE:
  Feature ownership:
    - Each feature has a designated owner (data engineer or data scientist)
    - Owner responsible for: data quality, freshness, documentation, deprecation
    - 342 features → 28 feature owners (avg. 12 features per owner)
    
  Feature documentation:
    - Description: What the feature represents, business meaning
    - Data type: int, float, string, boolean, enum (with allowed values)
    - Source: Original system, extraction method, transformation logic
    - Refresh: Frequency, latency, last update timestamp
    - Quality: Freshness SLA, completeness %, null rate %, accuracy %
    - Dependencies: Upstream features, source systems
    - Consumers: Models using this feature (bidirectional mapping)
    
  Feature versioning:
    - Semantic versioning: MAJOR.MINOR.PATCH
    - MAJOR: Breaking change (type change, logic change)
    - MINOR: Additive change (new derived feature)
    - PATCH: Fix (data quality correction)
    - Current: 342 active features, 88 deprecated (3-month grace period)
    
  Feature quality SLAs:
    Freshness: Real-time features < 5 min, hourly < 65 min, daily < 25 hours
    Completeness: > 99% (null rate < 1%)
    Accuracy: Validated against source system (daily spot check)
    Availability: 99.95% SLA (automated alert on outage)

FEATURE USAGE ANALYTICS:
  Most-used features (by model consumption):
    1. customer_age_days (24 models) — Customer domain
    2. transaction_count_7d (18 models) — Transaction domain
    3. device_fingerprint_hash (16 models) — Behavioral domain
    4. account_opening_date (15 models) — Customer domain
    5. avg_transaction_amount_30d (14 models) — Transaction domain
    
  Feature adoption rate: 87% (298 of 342 features used by at least one model)
  Unused features (> 90 days): 12 (flagged for deprecation review)
  Feature creation rate: 8-12 new features per month (steady growth)

Model Governance & Compliance

Model Risk Management Framework

MODEL GOVERNANCE — ENTERPRISE FRAMEWORK
=========================================

MODEL RISK CLASSIFICATION:
  ┌──────────────────────────┬────────────────────────────────────┬────────────────────┬──────────────────────┐
  │ Tier                     │ Criteria                           │ Models (Count)     │ Examples             │
  ├──────────────────────────┼────────────────────────────────────┼────────────────────┼──────────────────────┤
  │ Tier 1 — Critical        │ Direct financial impact, regulatory│ 8 (17% of total)   │ Credit Scoring,      │
  │                          │ requirement, fraud detection       │                    │ Fraud Detection,     │
  │                          │                                    │                    │ Insurance Underwriting│
  ├──────────────────────────┼────────────────────────────────────┼────────────────────┼──────────────────────┤
  │ Tier 2 — High            │ Customer-facing decisions,         │ 16 (34% of total)  │ Product Recommender, │
  │                          │ significant business impact,       │                    │ Churn Prediction,    │
  │                          │ revenue-driving                    │                    │ Dynamic Pricing      │
  ├──────────────────────────┼────────────────────────────────────┼────────────────────┼──────────────────────┤
  │ Tier 3 — Standard        │ Internal tools, lower impact,      │ 18 (38% of total)  │ Document Classifier, │
  │                          │ no direct customer/financial impact│                    │ Internal Search,     │
  │                          │                                    │                    │ Code Recommender     │
  ├──────────────────────────┼────────────────────────────────────┼────────────────────┼──────────────────────┤
  │ Tier 4 — Experimental    │ Research, prototyping, A/B testing │ 5 (11% of total)   │ New model experiments│
  │                          │                                    │                    │                      │
  └──────────────────────────┴────────────────────────────────────┴────────────────────┴──────────────────────┘

GOVERNANCE REQUIREMENTS BY TIER:

  TIER 1 (CRITICAL):
    Model Card: Required (comprehensive, reviewed quarterly)
      - Purpose and intended use
      - Training data description and limitations
      - Performance metrics (overall + by demographic segment)
      - Known limitations and failure modes
      - Ethical considerations and risk assessment
      - Environmental impact (training compute carbon footprint)
    
    Bias Testing: Mandatory (before deployment + quarterly retest)
      - Disparate impact analysis (protected classes: race, gender, age)
      - Equal opportunity difference < 0.05
      - Calibration by group (Brier score per segment)
      - Counterfactual fairness testing
    
    Explainability: Required (SHAP + LIME + counterfactual explanations)
      - Global feature importance (top 10 features driving predictions)
      - Local explainability (per-prediction feature attribution)
      - Counterfactual explanations ("what would need to change for different outcome")
      - Explanation quality score (stability, fidelity, sparsity)
    
    Monitoring: Real-time with < 1 hour alert SLA
      - Performance metrics (accuracy, precision, recall) — 15-minute windows
      - Data drift (PSI) — hourly calculation
      - Concept drift — daily calculation
      - Fairness metrics — weekly calculation
      - Alert escalation: ML Engineer → MLOps Lead → VP Engineering → CISO
    
    Retraining: Mandatory quarterly review + event-driven
      - Quarterly: Full retraining with governance review
      - Event-driven: Drift triggers (see retraining triggers section)
      - Annual: Full model audit (independent review by risk team)
    
    Approval: ML Review Board (5 members) + Compliance Officer + Legal Counsel
      - Board: CTO, Head of Data Science, Head of ML Engineering, Head of Risk, External Advisor
      - Quorum: 4 of 5 members required for decision
      - Documentation: Board minutes, voting record, dissenting opinions
      - Timeline: 5-7 business days for review
    
    Audit Trail: Full lineage from data → feature → model → prediction
      - Data provenance: Source systems, extraction timestamp, data hash
      - Feature lineage: Transformation logic, feature store version
      - Model lineage: Training run ID, hyperparameters, evaluation results
      - Prediction lineage: Input data hash, model version, output, timestamp
      - Retention: 7 years (regulatory requirement)
    
    Regulatory Compliance:
      - ECOA (Equal Credit Opportunity Act): Adverse action notice generation
      - FCRA (Fair Credit Reporting Act): Consumer dispute handling
      - GDPR Article 22: Right to explanation for automated decisions
      - CCPA: Right to opt-out of automated decision-making
      - Sector-specific: GLBA (banking), HIPAA (healthcare), ERISA (insurance)

  TIER 2 (HIGH):
    Model Card: Required (standard template, reviewed annually)
    Bias Testing: Recommended (before deployment, annual retest)
    Explainability: Feature importance scores (global + local)
    Monitoring: Hourly performance checks + daily drift analysis
    Retraining: Monthly evaluation + event-driven
    Approval: ML Engineering Lead + Business Stakeholder
    Audit Trail: Standard lineage (3-year retention)

  TIER 3 (STANDARD):
    Model Card: Documentation (lightweight, updated at deployment)
    Bias Testing: Not required (unless sensitive data involved)
    Explainability: Model documentation sufficient
    Monitoring: Daily performance checks + weekly drift analysis
    Retraining: Quarterly evaluation
    Approval: Peer review + ML Engineer sign-off
    Audit Trail: Basic lineage (1-year retention)

ANNUAL MODEL AUDIT PROGRAM:
  Scope: All Tier 1 models + random sample of Tier 2 models (20%)
  Auditor: Independent internal audit team (not MLOps or data science)
  Frequency: Annual (full audit) + quarterly (sampling)
  Audit areas:
    1. Model performance (actual vs. stated metrics)
    2. Data quality and provenance
    3. Bias and fairness (retest with audit methodology)
    4. Monitoring effectiveness (alert response times, resolution)
    5. Governance compliance (approvals, documentation, audit trail)
    6. Security (model artifact integrity, access controls)
    7. Business value (ROI, cost vs. benefit analysis)
  Last audit: Q4 2024
  Findings: 3 observations (all informational, no exceptions)
    1. Model card update cadence inconsistent (2 models) — remediation: automated reminder
    2. Feature deprecation timeline exceeded (1 feature) — remediation: enforced 30-day policy
    3. Shadow testing duration too short (1 model) — remediation: extended to 72 hours

Integration Points

Model registries: MLflow, Weights & Biases, SageMaker Model Registry, Azure ML Registry, GCP Vertex AI Model Registry
Feature stores: Feast, Tecton, Hopsworks, AWS SageMaker Feature Store, Databricks Feature Store
Orchestration: Kubeflow Pipelines, Airflow (ML DAGs), Prefect, Dagster, Metaflow
Monitoring: Evidently AI, WhyLabs, Arize AI, Fiddler AI, Prometheus + Grafana (custom)
Explainability: SHAP, LIME, Anchor, Captum (PyTorch), AIX360 (IBM)
Bias testing: Fairlearn, AIF360 (IBM), What-If Tool (Google), AI Metrics (Google)
CI/CD for ML: GitHub Actions (ML pipelines), GitLab CI, Jenkins, Argo CD (model deployment)
Container: Docker, Kubernetes, EKS, GKE, AKS, SageMaker Endpoints
Data platforms: Databricks, Snowflake, BigQuery, Redshift, Delta Lake
Testing: Great Expectations (data), DeepCheck (ML), TensorBoard (training), pytest (unit)
Governance: Model cards (Google template), MLflow model signatures, custom audit trail service
Cost management: AWS Cost Explorer, Kubecost (cluster), WhyLabs cost tracking

Edge Cases

Cold start (new model domain): No historical performance baseline; deploy to shadow mode for 2-4 weeks with human-in-the-loop validation before full routing. Track human override rate as proxy for model quality.

Catastrophic data drift (black swan event): Pandemic, regulation change, market crash — training data no longer representative. Immediate actions: (1) switch to rule-based fallback, (2) alert governance team, (3) rettrain with post-event data within 72 hours, (4) extend shadow testing to 2 weeks.

Multi-model cascade failure: Model A feeds predictions to Model B; Model A degrades silently. Mitigation: circuit breakers between dependent models, end-to-end validation of composite output, joint monitoring of model chains, fallback to independent model path.

GPU shortage during training: Training pipeline queued for 48+ hours. Mitigation: spot instance bidding with checkpoint/restart, training job priority queue (Tier 1 models first), multi-cloud GPU failover (AWS → GCP → Azure), model distillation to reduce training time.

Model explanation conflicts: SHAP attributions contradict business logic. Investigation: check for data leakage (feature correlates with label spuriously), validate feature engineering (derived feature encodes answer), review model architecture (overfitting to noise). Resolution: remove problematic feature, retrain with domain constraints.

Regulatory change mid-cycle: New fairness requirement introduced. Impact assessment: (1) identify affected models (Tier 1 handling sensitive decisions), (2) gap analysis (current metrics vs. new requirement), (3) remediation plan (retrain with fairness constraint, add post-processing equalization), (4) timeline (critical path: 4-8 weeks for full compliance).

Data pipeline failure (feature store outage): Real-time features unavailable for 2 hours. Fallback strategy: (1) use cached features (stale but available), (2) switch to model without real-time features (previously trained), (3) degrade to rule-based system, (4) queue predictions for batch processing when features restore.

Model poisoning (adversarial attack): Training data contaminated with malicious samples. Detection: (1) data provenance verification (source system integrity), (2) training data anomaly detection (distribution shift vs. historical), (3) canary dataset validation (known-good test set), (4) model behavior sanity check (extreme predictions on benign inputs). Prevention: (1) data validation pipeline, (2) differential privacy in training, (3) adversarial training examples, (4) model watermarking.

Inference latency spike (sudden traffic): p95 latency exceeds SLA during flash event. Response: (1) auto-scale inference replicas (Kubernetes HPA), (2) enable request queuing with priority (Tier 1 models first), (3) activate model cache for repeated inputs, (4) fallback to lighter model version (distilled) if heavy model cannot scale fast enough.

Feature store data inconsistency: Online and offline feature values diverge. Root causes: (1) different transformation logic (online uses simplified version), (2) timing mismatch (offline batch processed earlier data), (3) bug in online feature computation. Resolution: (1) point-in-time correct feature retrieval, (2) daily reconciliation job (compare online vs. offline, alert on drift), (3) unified feature computation code (same code path for online and offline).

Disclaimer: All rights reserved by Circulos AI. These skills are specifically designed for Claude Code, Claude Cowork, Codex, and OpenClaw. When using or referencing any skill, please provide proper attribution to Circulos AI.