---
name: mlops-model-ops
description: Implement MLOps practices for machine learning lifecycle management including model training, versioning, deployment, monitoring, and retraining. Use when building ML pipelines, managing model versions, deploying models to production, monitoring model drift, or automating retraining workflows. Triggers on phrases like "MLOps", "model deployment", "model monitoring", "model drift", "feature store", "model registry", "ML pipeline", "retraining", "A/B testing models", "model versioning", "experiment tracking", "model serving", "inference", "training pipeline".
---

# MLOps & Model Operations

Implement ML lifecycle management including training, deployment, monitoring, and automated retraining pipelines.

## Workflow

### 1. ML Pipeline Architecture

```
MLENDS-TO-PRODUCTION ARCHITECTURE
═══════════════════════════════════════

EXPERIMENTATION:
  → Notebooks: Jupyter, Databricks, VS Code
  → Experiment tracking: MLflow, Weights & Biases, TensorBoard
  → Hyperparameter tuning: Optuna, Ray Tune, Hyperopt

FEATURE ENGINEERING:
  → Feature store: Feast, Tecton, Hopsworks
  → Transformations: Spark, Pandas, dbt
  → Online features: Redis, DynamoDB (low-latency)
  → Offline features: Data warehouse/lake (batch)

MODEL TRAINING:
  → Framework: PyTorch, TensorFlow, XGBoost, scikit-learn
  → Orchestration: Airflow, Kubeflow, SageMaker Pipelines
  → Compute: GPU instances (A100, V100), spot for cost savings
  → Distributed training: Horovod, PyTorch DDP

MODEL REGISTRY:
  → Registry: MLflow Model Registry, SageMaker Model Registry
  → Stages: Development → Staging → Production → Archive
  → Versioning: Semantic versioning + artifact hashes
  → Approval: Manual sign-off for production promotion

MODEL SERVING:
  → Real-time: FastAPI + TorchServe / TensorFlow Serving
  → Batch: Spark ML, SageMaker Batch Transform
  → Edge: TensorFlow Lite, ONNX Runtime
  → API gateway: Kong, API Gateway, Cloud Run

MONITORING:
  → Data drift: Evidently AI, Alibi Detect, NannyCafe
  → Model performance: Custom metrics dashboard
  → Infrastructure: Prometheus + Grafana
  → Logging: Structured logs + MLflow
```

### 2. Model Training Pipeline

```
TRAINING PIPELINE — Customer Churn Prediction
═══════════════════════════════════════

DATA PREPARATION:
═══════════════════════════════════════

  → Training data: 18 months of customer data (n=500,000)
  → Target: churned (binary, within next 90 days)
  → Features: 45 (from feature store)
    · Account features: tenure, plan type, payment method
    · Usage features: calls/month, data usage, support tickets
    · Billing features: invoice amount, late payments, discounts
    · Engagement features: login frequency, feature adoption
    · Cohort features: acquisition channel, signup date

  → Train/validation/test split: 70/15/15
  → Time-based split (not random): Train on months 1-12, val on 13-15, test on 16-18
  → Handle imbalance: SMOTE / class weights (churn rate: 8%)

EXPERIMENT TRACKING:
═══════════════════════════════════════

Run ID    Model        Params              AUC     Recall    F1      Duration
───────────────────────────────────────────────────────────────────────────
001       XGBoost      lr=0.1, depth=6     0.78    0.65      0.62    45 min
002       XGBoost      lr=0.05, depth=8    0.82    0.72      0.68    62 min
003       LightGBM     lr=0.05, n=500      0.84    0.75      0.71    55 min ← BEST
004       RandomForest  lr=0.1, n=200      0.76    0.68      0.64    120 min
005       NeuralNet    hidden=[128,64]     0.83    0.73      0.70    90 min

BEST MODEL: LightGBM (Run 003)
  → AUC-ROC: 0.84
  → Recall (churners): 0.75
  → Precision: 0.68
  → F1: 0.71
  → Calibration: Well-calibrated (Brier score: 0.18)

MODEL VALIDATION:
═══════════════════════════════════════

  → Train/test AUC gap: 0.04 (0.88 train vs 0.84 test) — acceptable
  → Feature importance: Top 5 features account for 60% of importance
  → SHAP values: Consistent with domain knowledge
  → Fairness check: No significant disparity across customer segments
  → Stress test: Performance on edge cases (new customers, high-value)
```

### 3. Model Deployment

```
MODEL DEPLOYMENT STRATEGY
═══════════════════════════════════════

DEPLOYMENT APPROACH: Canary (Phased Rollout)
═══════════════════════════════════════

Phase 1: Shadow Mode (1 week)
  → New model runs alongside existing
  → Predictions logged but NOT used for decisions
  → Compare: new model predictions vs current model
  → Metrics: latency, error rate, prediction distribution

Phase 2: Canary (1 week)
  → New model serves 5% of traffic
  → Monitor: prediction accuracy, business impact
  → A/B test: retention rate for canary group vs control
  → Rollback trigger: <2% improvement or any regression

Phase 3: Gradual Rollout (2 weeks)
  → 5% → 25% → 50% → 100%
  → Weekly checkpoint at each step
  → Business stakeholder review at 50%

Phase 4: Full Production
  → New model at 100%
  → Old model retained for 2 weeks (rollback)
  → Monitoring: Data drift, performance decay

INFRASTRUCTURE:
═══════════════════════════════════════

Container: Docker (Python 3.11 + dependencies)
Registry: ECR / GCR with image signing
Orchestrator: Kubernetes with autoscaling
API: FastAPI with async endpoints
Load balancer: ALB / Cloud Run
Caching: Redis for feature caching (TTL: 1 hour)

SCALING:
═══════════════════════════════════════

  → Min instances: 2 (high availability)
  → Max instances: 20
  → Target CPU: 60%
  → Target memory: 75%
  → Request timeout: 5 seconds
  → Rate limit: 1,000 requests/minute per customer
  → Circuit breaker: Trip at 5% error rate, reset after 30s
```

### 4. Model Monitoring & Drift Detection

```
MODEL MONITORING DASHBOARD
═══════════════════════════════════════

INFRASTRUCTURE METRICS:
═══════════════════════════════════════

  → Request latency (P50, P95, P99): 15ms / 45ms / 120ms
  → Throughput: 500 requests/second
  → Error rate: 0.02% (target: <0.1%)
  → Instance utilization: 3 instances (of 2-20 range)
  → Uptime: 99.97% (last 30 days)

DATA DRIFT DETECTION:
═══════════════════════════════════════

  → Feature distribution comparison (Kolmogorov-Smirnov test)
  → Training data vs production data (rolling 7-day window)
  → Alert threshold: p-value < 0.01 on any feature

Drift detected (Week 12):
═══════════════════════════════════════

Feature              Drift Score    Status    Impact Assessment
──────────────────────────────────────────────────────────────────
tenure_months        0.008          ⚠ MEDIUM  Distribution shifted
plan_type            0.001          ✓ OK      Stable
support_tickets_30d  0.003          ⚠ LOW     Minor shift
invoice_amount       0.012          🔴 HIGH   Significant drift — investigate
login_frequency      0.002          ✓ OK      Stable

Root cause: New customer acquisition campaign brought different customer profile
Action: Schedule retraining with updated data

MODEL PERFORMANCE DECAY:
═══════════════════════════════════════

Week    AUC     Recall    Precision    F1      Trend
─────────────────────────────────────────────────────
1       0.84    0.75      0.68         0.71    Baseline
4       0.83    0.74      0.67         0.70    Stable
8       0.82    0.72      0.66         0.69    Slight decay
12      0.79    0.69      0.63         0.66    Decay accelerating ⚠
16      0.76    0.65      0.60         0.62    Below threshold 🔴

Threshold: AUC < 0.78 triggers retraining evaluation
Current: 0.76 — RETRAINING RECOMMENDED

AUTOMATED RETRAINING TRIGGER:
═══════════════════════════════════════

Conditions (any triggers retraining):
  1. AUC drops below 0.78 (performance threshold)
  2. Data drift score > 0.01 on 3+ features
  3. Scheduled: Monthly retraining (conservative)
  4. New training data available: >10,000 new labeled records

Retraining workflow:
  → Pull latest data from feature store
  → Train candidate model with same hyperparameters
  → Validate on holdout set
  → Compare to current production model (champion vs challenger)
  → If challenger better by >2% AUC: promote automatically
  → If challenger worse: keep current model, log for review
```

### 5. Feature Store Management

```
FEATURE STORE — Feast Architecture
═══════════════════════════════════════

FEATURE DEFINITIONS:
═══════════════════════════════════════

Feature Group: customer_account
  Entity: customer_id
  Features:
    - tenure_days (int64)
    - plan_type (string)
    - payment_method (string)
    - monthly_revenue (float64)
    - late_payment_count_30d (int64)
  Update frequency: Daily
  Source: PostgreSQL (CDC via Debezium)

Feature Group: customer_usage
  Entity: customer_id
  Features:
    - calls_last_30d (float64)
    - data_usage_gb_last_30d (float64)
    - support_tickets_last_30d (int64)
    - login_count_last_7d (int64)
    - features_used_count (int64)
  Update frequency: Hourly
  Source: ClickHouse (aggregated)

ONLINE vs OFFLINE STORE:
═══════════════════════════════════════

Online Store (low-latency, serving):
  → Backend: Redis / DynamoDB
  → Latency: <10ms per feature retrieval
  → Size: ~500K customers × 45 features × 8 bytes = ~180MB

Offline Store (batch, training):
  → Backend: Snowflake / BigQuery / Parquet on S3
  → Size: Historical features (18 months) = ~2TB
  → Query: Feature engineering + training data prep

FEATURE GOVERNANCE:
═══════════════════════════════════════

  → Feature registry: Name, type, description, owner, freshness
  → Dependency tracking: Feature → Model mapping
  → Point-in-time correctness: Avoid data leakage
  → Feature sharing: Reuse across models
  → Deprecation: Retire unused features after 90 days
```

## Edge Cases

- **Multi-model pipelines**: Ensemble models requiring multiple inference calls
- **Real-time features**: Feature computation at inference time
- **Regulated industries**: Model explainability, audit trails, FDA approval
- **Edge deployment**: Resource-constrained environments (mobile, IoT)
- **Multi-region**: Model replication, latency optimization

## Integration Points

- **ML frameworks**: PyTorch, TensorFlow, XGBoost, scikit-learn
- **Orchestration**: Airflow, Kubeflow, SageMaker
- **Feature stores**: Feast, Tecton, Hopsworks
- **Model registries**: MLflow, SageMaker Model Registry
- **Monitoring**: Prometheus, Grafana, Evidently AI
- **Cloud ML**: SageMaker, Vertex AI, Azure ML, Databricks MLflow

## Output

### MLOps Status Report

```
MODEL STATUS — Customer Churn Prediction
═══════════════════════════════════════

Production model: LightGBM v3.2.1 (deployed Week 8)
Current performance: AUC 0.76 (below threshold of 0.78)
Data drift: Detected on invoice_amount (HIGH)
Requests/sec: 500, Latency P95: 45ms

Action: Automated retraining pipeline triggered
  → Expected completion: 2 hours
  → Validation: Holdout set (last 6 months)
  → Promotion: Auto if AUC > 0.80
```
