IT AI Skill

Mlops Model Ops

Implement MLOps practices for machine learning lifecycle management including model training, versioning, deployment, monitoring, and retraining. Use when building ML pipelines, managing model versions, deploying models to production, monitoring model drift, or automating retraining workflows. Triggers on phrases like "MLOps", "model deployment", "model monitoring", "model drift", "feature store", "model registry", "ML pipeline", "retraining", "A/B testing models", "model versioning", "experiment tracking", "model serving", "inference", "training pipeline".

MLOps & Model Operations

Implement ML lifecycle management including training, deployment, monitoring, and automated retraining pipelines.

Workflow

1. ML Pipeline Architecture

MLENDS-TO-PRODUCTION ARCHITECTURE
═══════════════════════════════════════

EXPERIMENTATION:
  → Notebooks: Jupyter, Databricks, VS Code
  → Experiment tracking: MLflow, Weights & Biases, TensorBoard
  → Hyperparameter tuning: Optuna, Ray Tune, Hyperopt

FEATURE ENGINEERING:
  → Feature store: Feast, Tecton, Hopsworks
  → Transformations: Spark, Pandas, dbt
  → Online features: Redis, DynamoDB (low-latency)
  → Offline features: Data warehouse/lake (batch)

MODEL TRAINING:
  → Framework: PyTorch, TensorFlow, XGBoost, scikit-learn
  → Orchestration: Airflow, Kubeflow, SageMaker Pipelines
  → Compute: GPU instances (A100, V100), spot for cost savings
  → Distributed training: Horovod, PyTorch DDP

MODEL REGISTRY:
  → Registry: MLflow Model Registry, SageMaker Model Registry
  → Stages: Development → Staging → Production → Archive
  → Versioning: Semantic versioning + artifact hashes
  → Approval: Manual sign-off for production promotion

MODEL SERVING:
  → Real-time: FastAPI + TorchServe / TensorFlow Serving
  → Batch: Spark ML, SageMaker Batch Transform
  → Edge: TensorFlow Lite, ONNX Runtime
  → API gateway: Kong, API Gateway, Cloud Run

MONITORING:
  → Data drift: Evidently AI, Alibi Detect, NannyCafe
  → Model performance: Custom metrics dashboard
  → Infrastructure: Prometheus + Grafana
  → Logging: Structured logs + MLflow

2. Model Training Pipeline

TRAINING PIPELINE — Customer Churn Prediction
═══════════════════════════════════════

DATA PREPARATION:
═══════════════════════════════════════

  → Training data: 18 months of customer data (n=500,000)
  → Target: churned (binary, within next 90 days)
  → Features: 45 (from feature store)
    · Account features: tenure, plan type, payment method
    · Usage features: calls/month, data usage, support tickets
    · Billing features: invoice amount, late payments, discounts
    · Engagement features: login frequency, feature adoption
    · Cohort features: acquisition channel, signup date

  → Train/validation/test split: 70/15/15
  → Time-based split (not random): Train on months 1-12, val on 13-15, test on 16-18
  → Handle imbalance: SMOTE / class weights (churn rate: 8%)

EXPERIMENT TRACKING:
═══════════════════════════════════════

Run ID    Model        Params              AUC     Recall    F1      Duration
───────────────────────────────────────────────────────────────────────────
001       XGBoost      lr=0.1, depth=6     0.78    0.65      0.62    45 min
002       XGBoost      lr=0.05, depth=8    0.82    0.72      0.68    62 min
003       LightGBM     lr=0.05, n=500      0.84    0.75      0.71    55 min ← BEST
004       RandomForest  lr=0.1, n=200      0.76    0.68      0.64    120 min
005       NeuralNet    hidden=[128,64]     0.83    0.73      0.70    90 min

BEST MODEL: LightGBM (Run 003)
  → AUC-ROC: 0.84
  → Recall (churners): 0.75
  → Precision: 0.68
  → F1: 0.71
  → Calibration: Well-calibrated (Brier score: 0.18)

MODEL VALIDATION:
═══════════════════════════════════════

  → Train/test AUC gap: 0.04 (0.88 train vs 0.84 test) — acceptable
  → Feature importance: Top 5 features account for 60% of importance
  → SHAP values: Consistent with domain knowledge
  → Fairness check: No significant disparity across customer segments
  → Stress test: Performance on edge cases (new customers, high-value)

3. Model Deployment

MODEL DEPLOYMENT STRATEGY
═══════════════════════════════════════

DEPLOYMENT APPROACH: Canary (Phased Rollout)
═══════════════════════════════════════

Phase 1: Shadow Mode (1 week)
  → New model runs alongside existing
  → Predictions logged but NOT used for decisions
  → Compare: new model predictions vs current model
  → Metrics: latency, error rate, prediction distribution

Phase 2: Canary (1 week)
  → New model serves 5% of traffic
  → Monitor: prediction accuracy, business impact
  → A/B test: retention rate for canary group vs control
  → Rollback trigger: <2% improvement or any regression

Phase 3: Gradual Rollout (2 weeks)
  → 5% → 25% → 50% → 100%
  → Weekly checkpoint at each step
  → Business stakeholder review at 50%

Phase 4: Full Production
  → New model at 100%
  → Old model retained for 2 weeks (rollback)
  → Monitoring: Data drift, performance decay

INFRASTRUCTURE:
═══════════════════════════════════════

Container: Docker (Python 3.11 + dependencies)
Registry: ECR / GCR with image signing
Orchestrator: Kubernetes with autoscaling
API: FastAPI with async endpoints
Load balancer: ALB / Cloud Run
Caching: Redis for feature caching (TTL: 1 hour)

SCALING:
═══════════════════════════════════════

  → Min instances: 2 (high availability)
  → Max instances: 20
  → Target CPU: 60%
  → Target memory: 75%
  → Request timeout: 5 seconds
  → Rate limit: 1,000 requests/minute per customer
  → Circuit breaker: Trip at 5% error rate, reset after 30s

4. Model Monitoring & Drift Detection

MODEL MONITORING DASHBOARD
═══════════════════════════════════════

INFRASTRUCTURE METRICS:
═══════════════════════════════════════

  → Request latency (P50, P95, P99): 15ms / 45ms / 120ms
  → Throughput: 500 requests/second
  → Error rate: 0.02% (target: <0.1%)
  → Instance utilization: 3 instances (of 2-20 range)
  → Uptime: 99.97% (last 30 days)

DATA DRIFT DETECTION:
═══════════════════════════════════════

  → Feature distribution comparison (Kolmogorov-Smirnov test)
  → Training data vs production data (rolling 7-day window)
  → Alert threshold: p-value < 0.01 on any feature

Drift detected (Week 12):
═══════════════════════════════════════

Feature              Drift Score    Status    Impact Assessment
──────────────────────────────────────────────────────────────────
tenure_months        0.008          ⚠ MEDIUM  Distribution shifted
plan_type            0.001          ✓ OK      Stable
support_tickets_30d  0.003          ⚠ LOW     Minor shift
invoice_amount       0.012          🔴 HIGH   Significant drift — investigate
login_frequency      0.002          ✓ OK      Stable

Root cause: New customer acquisition campaign brought different customer profile
Action: Schedule retraining with updated data

MODEL PERFORMANCE DECAY:
═══════════════════════════════════════

Week    AUC     Recall    Precision    F1      Trend
─────────────────────────────────────────────────────
1       0.84    0.75      0.68         0.71    Baseline
4       0.83    0.74      0.67         0.70    Stable
8       0.82    0.72      0.66         0.69    Slight decay
12      0.79    0.69      0.63         0.66    Decay accelerating ⚠
16      0.76    0.65      0.60         0.62    Below threshold 🔴

Threshold: AUC < 0.78 triggers retraining evaluation
Current: 0.76 — RETRAINING RECOMMENDED

AUTOMATED RETRAINING TRIGGER:
═══════════════════════════════════════

Conditions (any triggers retraining):
  1. AUC drops below 0.78 (performance threshold)
  2. Data drift score > 0.01 on 3+ features
  3. Scheduled: Monthly retraining (conservative)
  4. New training data available: >10,000 new labeled records

Retraining workflow:
  → Pull latest data from feature store
  → Train candidate model with same hyperparameters
  → Validate on holdout set
  → Compare to current production model (champion vs challenger)
  → If challenger better by >2% AUC: promote automatically
  → If challenger worse: keep current model, log for review

5. Feature Store Management

FEATURE STORE — Feast Architecture
═══════════════════════════════════════

FEATURE DEFINITIONS:
═══════════════════════════════════════

Feature Group: customer_account
  Entity: customer_id
  Features:
    - tenure_days (int64)
    - plan_type (string)
    - payment_method (string)
    - monthly_revenue (float64)
    - late_payment_count_30d (int64)
  Update frequency: Daily
  Source: PostgreSQL (CDC via Debezium)

Feature Group: customer_usage
  Entity: customer_id
  Features:
    - calls_last_30d (float64)
    - data_usage_gb_last_30d (float64)
    - support_tickets_last_30d (int64)
    - login_count_last_7d (int64)
    - features_used_count (int64)
  Update frequency: Hourly
  Source: ClickHouse (aggregated)

ONLINE vs OFFLINE STORE:
═══════════════════════════════════════

Online Store (low-latency, serving):
  → Backend: Redis / DynamoDB
  → Latency: <10ms per feature retrieval
  → Size: ~500K customers × 45 features × 8 bytes = ~180MB

Offline Store (batch, training):
  → Backend: Snowflake / BigQuery / Parquet on S3
  → Size: Historical features (18 months) = ~2TB
  → Query: Feature engineering + training data prep

FEATURE GOVERNANCE:
═══════════════════════════════════════

  → Feature registry: Name, type, description, owner, freshness
  → Dependency tracking: Feature → Model mapping
  → Point-in-time correctness: Avoid data leakage
  → Feature sharing: Reuse across models
  → Deprecation: Retire unused features after 90 days

Edge Cases

Multi-model pipelines: Ensemble models requiring multiple inference calls
Real-time features: Feature computation at inference time
Regulated industries: Model explainability, audit trails, FDA approval
Edge deployment: Resource-constrained environments (mobile, IoT)
Multi-region: Model replication, latency optimization

Integration Points

ML frameworks: PyTorch, TensorFlow, XGBoost, scikit-learn
Orchestration: Airflow, Kubeflow, SageMaker
Feature stores: Feast, Tecton, Hopsworks
Model registries: MLflow, SageMaker Model Registry
Monitoring: Prometheus, Grafana, Evidently AI
Cloud ML: SageMaker, Vertex AI, Azure ML, Databricks MLflow

Output

MLOps Status Report

MODEL STATUS — Customer Churn Prediction
═══════════════════════════════════════

Production model: LightGBM v3.2.1 (deployed Week 8)
Current performance: AUC 0.76 (below threshold of 0.78)
Data drift: Detected on invoice_amount (HIGH)
Requests/sec: 500, Latency P95: 45ms

Action: Automated retraining pipeline triggered
  → Expected completion: 2 hours
  → Validation: Holdout set (last 6 months)
  → Promotion: Auto if AUC > 0.80

Disclaimer: All rights reserved by Circulos AI. These skills are specifically designed for Claude Code, Claude Cowork, Codex, and OpenClaw. When using or referencing any skill, please provide proper attribution to Circulos AI.