IT AI Skill
Mlops Model Ops
Implement MLOps practices for machine learning lifecycle management including model training, versioning, deployment, monitoring, and retraining. Use when building ML pipelines, managing model versions, deploying models to production, monitoring model drift...
MLOps & Model Operations
Implement ML lifecycle management including training, deployment, monitoring, and automated retraining pipelines.
Workflow
1. ML Pipeline Architecture
MLENDS-TO-PRODUCTION ARCHITECTURE
═══════════════════════════════════════
EXPERIMENTATION:
→ Notebooks: Jupyter, Databricks, VS Code
→ Experiment tracking: MLflow, Weights & Biases, TensorBoard
→ Hyperparameter tuning: Optuna, Ray Tune, Hyperopt
FEATURE ENGINEERING:
→ Feature store: Feast, Tecton, Hopsworks
→ Transformations: Spark, Pandas, dbt
→ Online features: Redis, DynamoDB (low-latency)
→ Offline features: Data warehouse/lake (batch)
MODEL TRAINING:
→ Framework: PyTorch, TensorFlow, XGBoost, scikit-learn
→ Orchestration: Airflow, Kubeflow, SageMaker Pipelines
→ Compute: GPU instances (A100, V100), spot for cost savings
→ Distributed training: Horovod, PyTorch DDP
MODEL REGISTRY:
→ Registry: MLflow Model Registry, SageMaker Model Registry
→ Stages: Development → Staging → Production → Archive
→ Versioning: Semantic versioning + artifact hashes
→ Approval: Manual sign-off for production promotion
MODEL SERVING:
→ Real-time: FastAPI + TorchServe / TensorFlow Serving
→ Batch: Spark ML, SageMaker Batch Transform
→ Edge: TensorFlow Lite, ONNX Runtime
→ API gateway: Kong, API Gateway, Cloud Run
MONITORING:
→ Data drift: Evidently AI, Alibi Detect, NannyCafe
→ Model performance: Custom metrics dashboard
→ Infrastructure: Prometheus + Grafana
→ Logging: Structured logs + MLflow
2. Model Training Pipeline
TRAINING PIPELINE — Customer Churn Prediction
═══════════════════════════════════════
DATA PREPARATION:
═══════════════════════════════════════
→ Training data: 18 months of customer data (n=500,000)
→ Target: churned (binary, within next 90 days)
→ Features: 45 (from feature store)
· Account features: tenure, plan type, payment method
· Usage features: calls/month, data usage, support tickets
· Billing features: invoice amount, late payments, discounts
· Engagement features: login frequency, feature adoption
· Cohort features: acquisition channel, signup date
→ Train/validation/test split: 70/15/15
→ Time-based split (not random): Train on months 1-12, val on 13-15, test on 16-18
→ Handle imbalance: SMOTE / class weights (churn rate: 8%)
EXPERIMENT TRACKING:
═══════════════════════════════════════
Run ID Model Params AUC Recall F1 Duration
───────────────────────────────────────────────────────────────────────────
001 XGBoost lr=0.1, depth=6 0.78 0.65 0.62 45 min
002 XGBoost lr=0.05, depth=8 0.82 0.72 0.68 62 min
003 LightGBM lr=0.05, n=500 0.84 0.75 0.71 55 min ← BEST
004 RandomForest lr=0.1, n=200 0.76 0.68 0.64 120 min
005 NeuralNet hidden=[128,64] 0.83 0.73 0.70 90 min
BEST MODEL: LightGBM (Run 003)
→ AUC-ROC: 0.84
→ Recall (churners): 0.75
→ Precision: 0.68
→ F1: 0.71
→ Calibration: Well-calibrated (Brier score: 0.18)
MODEL VALIDATION:
═══════════════════════════════════════
→ Train/test AUC gap: 0.04 (0.88 train vs 0.84 test) — acceptable
→ Feature importance: Top 5 features account for 60% of importance
→ SHAP values: Consistent with domain knowledge
→ Fairness check: No significant disparity across customer segments
→ Stress test: Performance on edge cases (new customers, high-value)
3. Model Deployment
MODEL DEPLOYMENT STRATEGY
═══════════════════════════════════════
DEPLOYMENT APPROACH: Canary (Phased Rollout)
═══════════════════════════════════════
Phase 1: Shadow Mode (1 week)
→ New model runs alongside existing
→ Predictions logged but NOT used for decisions
→ Compare: new model predictions vs current model
→ Metrics: latency, error rate, prediction distribution
Phase 2: Canary (1 week)
→ New model serves 5% of traffic
→ Monitor: prediction accuracy, business impact
→ A/B test: retention rate for canary group vs control
→ Rollback trigger: <2% improvement or any regression
Phase 3: Gradual Rollout (2 weeks)
→ 5% → 25% → 50% → 100%
→ Weekly checkpoint at each step
→ Business stakeholder review at 50%
Phase 4: Full Production
→ New model at 100%
→ Old model retained for 2 weeks (rollback)
→ Monitoring: Data drift, performance decay
INFRASTRUCTURE:
═══════════════════════════════════════
Container: Docker (Python 3.11 + dependencies)
Registry: ECR / GCR with image signing
Orchestrator: Kubernetes with autoscaling
API: FastAPI with async endpoints
Load balancer: ALB / Cloud Run
Caching: Redis for feature caching (TTL: 1 hour)
SCALING:
═══════════════════════════════════════
→ Min instances: 2 (high availability)
→ Max instances: 20
→ Target CPU: 60%
→ Target memory: 75%
→ Request timeout: 5 seconds
→ Rate limit: 1,000 requests/minute per customer
→ Circuit breaker: Trip at 5% error rate, reset after 30s
4. Model Monitoring & Drift Detection
MODEL MONITORING DASHBOARD
═══════════════════════════════════════
INFRASTRUCTURE METRICS:
═══════════════════════════════════════
→ Request latency (P50, P95, P99): 15ms / 45ms / 120ms
→ Throughput: 500 requests/second
→ Error rate: 0.02% (target: <0.1%)
→ Instance utilization: 3 instances (of 2-20 range)
→ Uptime: 99.97% (last 30 days)
DATA DRIFT DETECTION:
═══════════════════════════════════════
→ Feature distribution comparison (Kolmogorov-Smirnov test)
→ Training data vs production data (rolling 7-day window)
→ Alert threshold: p-value < 0.01 on any feature
Drift detected (Week 12):
═══════════════════════════════════════
Feature Drift Score Status Impact Assessment
──────────────────────────────────────────────────────────────────
tenure_months 0.008 ⚠ MEDIUM Distribution shifted
plan_type 0.001 ✓ OK Stable
support_tickets_30d 0.003 ⚠ LOW Minor shift
invoice_amount 0.012 🔴 HIGH Significant drift — investigate
login_frequency 0.002 ✓ OK Stable
Root cause: New customer acquisition campaign brought different customer profile
Action: Schedule retraining with updated data
MODEL PERFORMANCE DECAY:
═══════════════════════════════════════
Week AUC Recall Precision F1 Trend
─────────────────────────────────────────────────────
1 0.84 0.75 0.68 0.71 Baseline
4 0.83 0.74 0.67 0.70 Stable
8 0.82 0.72 0.66 0.69 Slight decay
12 0.79 0.69 0.63 0.66 Decay accelerating ⚠
16 0.76 0.65 0.60 0.62 Below threshold 🔴
Threshold: AUC < 0.78 triggers retraining evaluation
Current: 0.76 — RETRAINING RECOMMENDED
AUTOMATED RETRAINING TRIGGER:
═══════════════════════════════════════
Conditions (any triggers retraining):
1. AUC drops below 0.78 (performance threshold)
2. Data drift score > 0.01 on 3+ features
3. Scheduled: Monthly retraining (conservative)
4. New training data available: >10,000 new labeled records
Retraining workflow:
→ Pull latest data from feature store
→ Train candidate model with same hyperparameters
→ Validate on holdout set
→ Compare to current production model (champion vs challenger)
→ If challenger better by >2% AUC: promote automatically
→ If challenger worse: keep current model, log for review
5. Feature Store Management
FEATURE STORE — Feast Architecture
═══════════════════════════════════════
FEATURE DEFINITIONS:
═══════════════════════════════════════
Feature Group: customer_account
Entity: customer_id
Features:
- tenure_days (int64)
- plan_type (string)
- payment_method (string)
- monthly_revenue (float64)
- late_payment_count_30d (int64)
Update frequency: Daily
Source: PostgreSQL (CDC via Debezium)
Feature Group: customer_usage
Entity: customer_id
Features:
- calls_last_30d (float64)
- data_usage_gb_last_30d (float64)
- support_tickets_last_30d (int64)
- login_count_last_7d (int64)
- features_used_count (int64)
Update frequency: Hourly
Source: ClickHouse (aggregated)
ONLINE vs OFFLINE STORE:
═══════════════════════════════════════
Online Store (low-latency, serving):
→ Backend: Redis / DynamoDB
→ Latency: <10ms per feature retrieval
→ Size: ~500K customers × 45 features × 8 bytes = ~180MB
Offline Store (batch, training):
→ Backend: Snowflake / BigQuery / Parquet on S3
→ Size: Historical features (18 months) = ~2TB
→ Query: Feature engineering + training data prep
FEATURE GOVERNANCE:
═══════════════════════════════════════
→ Feature registry: Name, type, description, owner, freshness
→ Dependency tracking: Feature → Model mapping
→ Point-in-time correctness: Avoid data leakage
→ Feature sharing: Reuse across models
→ Deprecation: Retire unused features after 90 days
Edge Cases
- Multi-model pipelines: Ensemble models requiring multiple inference calls
- Real-time features: Feature computation at inference time
- Regulated industries: Model explainability, audit trails, FDA approval
- Edge deployment: Resource-constrained environments (mobile, IoT)
- Multi-region: Model replication, latency optimization
Integration Points
- ML frameworks: PyTorch, TensorFlow, XGBoost, scikit-learn
- Orchestration: Airflow, Kubeflow, SageMaker
- Feature stores: Feast, Tecton, Hopsworks
- Model registries: MLflow, SageMaker Model Registry
- Monitoring: Prometheus, Grafana, Evidently AI
- Cloud ML: SageMaker, Vertex AI, Azure ML, Databricks MLflow
Output
MLOps Status Report
MODEL STATUS — Customer Churn Prediction
═══════════════════════════════════════
Production model: LightGBM v3.2.1 (deployed Week 8)
Current performance: AUC 0.76 (below threshold of 0.78)
Data drift: Detected on invoice_amount (HIGH)
Requests/sec: 500, Latency P95: 45ms
Action: Automated retraining pipeline triggered
→ Expected completion: 2 hours
→ Validation: Holdout set (last 6 months)
→ Promotion: Auto if AUC > 0.80