IT AI Skill

Infrastructure Monitoring

Manage infrastructure monitoring and observability including metrics collection, log aggregation, distributed tracing, alerting and escalation, SLA/SLO tracking, capacity planning, and incident correlation. Use when monitoring system health, managing alerts...

Infrastructure Monitoring & Observability

Ensure system reliability through comprehensive monitoring, proactive alerting, and data-driven capacity planning.

Monitoring Stack & Architecture

Observability Framework

OBSERVABILITY FRAMEWORK:
════════════════════════

MONITORING STACK:
  Metrics: Datadog (primary) + Prometheus (K8s native)
  Logs: Datadog Log Management + ELK Stack (backup)
  Tracing: Datadog APM + OpenTelemetry (standard)
  Alerting: Datadog Monitors + PagerDuty (escalation)
  Dashboards: Datadog Dashboards + Grafana (custom)
  Uptime: Datadog Synthetic Monitoring + UptimeRobot (backup)

MONITORING COVERAGE:
  ┌──────────────────────────┬──────────┬─────────────────────┐
  │ Layer                    │ Count    │ Tool                │
  ├──────────────────────────┼──────────┼─────────────────────┤
  │ Cloud infrastructure     │ 115      │ Datadog AWS/Azure   │
  │   (AWS + Azure)          │          │ integration         │
  │ Kubernetes clusters      │ 8        │ Prometheus + DD     │
  │ Kubernetes pods          │ 380      │ Prometheus + DD     │
  │ VMs (Linux + Windows)   │ 165      │ Datadog Agent       │
  │ Containers (non-K8s)     │ 45       │ Datadog Agent       │
  │ Databases                │ 28       │ Datadog DB monitor  │
  │   (PostgreSQL, MySQL,   │          │                     │
  │   MongoDB, Redis)        │          │                     │
  │ Networks (VPC, load      │ 42       │ Datadog Network     │
  │   balancers, DNS)        │          │                     │
  │ Applications             │ 35       │ Datadog APM         │
  │   (microservices, APIs)  │          │                     │
  │ CDNs/WAF                 │ 5        │ Datadog Synthetics  │
  │ SaaS applications        │ 18       │ Datadog SaaS        │
  │   (Salesforce, M365,     │          │ integration         │
  │   Slack, etc.)           │          │                     │
  │ Endpoints (workstations) │ 450      │ CrowdStrike (basic) │
  │ ────────────────────── │ ────── │ ────────────────── │
  │ TOTAL                  │ 1,293  │ Multiple            │
  └──────────────────────────┴──────────┴─────────────────────┘

  Monitoring depth:
    Infrastructure: CPU, memory, disk, network (30-second interval)
    Application: Response time, error rate, throughput (10-second interval)
    Database: Queries/second, connections, latency (15-second interval)
    Network: Bandwidth, packet loss, latency (30-second interval)
    Business: API calls, transactions, user activity (1-minute interval)

METRICS COLLECTION:
  Total metrics (series): 85K (unique)
  Data points/minute: 2.5M
  Data points/day: 3.6B
  Storage: 90 days (hot) + 1 year (cold)
  Cost optimization:
    - Cardinality management (tag limiting)
    - Metric tiering (critical: 10s, standard: 30s, background: 5min)
    - Auto-retention (low-value metrics → cold after 30 days)
    - Monthly cost: ~$8K (Datadog) + $3K (infrastructure)

LOG MANAGEMENT:
  Total logs/minute: 180K
  Total logs/day: 260M
  Log sources: 95 (servers, containers, applications, databases, networks)
  Retention: 30 days (standard) + 1 year (security logs)
  Parsing: Structured (JSON preferred) + regex parsing (legacy)
  Indexing: Full text + structured fields
  Cost optimization:
    - Log sampling (info/debug: 10%)
    - Log filtering (exclude noisy patterns)
    - Auto-archive (logs >30 days → cold storage)

DISTRIBUTED TRACING:
  Trace volume: 2.5M traces/day
  Sample rate: 10% (standard) + 100% (error traces)
  Propagation: W3C Trace Context (standard)
  Coverage: 85% of microservices (instrumented)
  Instruments: Datadog APM agents (Java, Python, Node.js, Go)
  End-to-end visibility: API → Service → Database → Cache

MONITORING HEALTH:
  Agent coverage: 99.1% (12 assets without agent — decommissioning)
  Agent uptime: 99.8% (avg. per agent)
  Data gap incidents: 0.2/day (avg. — auto-recovered)
  Alert noise: Managed (see alerting section)
  Dashboard freshness: Real-time (<30 second latency)

Alerting & Escalation

Intelligent Alerting Framework

ALERTING FRAMEWORK:
════════════════════

ALERT CLASSIFICATION:
  ┌────────────────────────┬──────────┬───────────┬───────────────────┐
  │ Severity              │ Count    │ Response  │ Escalation        │
  │                       │ (active) │ Time      │ Path              │
  ├────────────────────────┼──────────┼───────────┼───────────────────┤
  │ P1 — Critical         │ 18       │ Immediate │ L1 → L2 → L3     │
  │   (system down, data  │          │ (<5 min)  │ → CTO (30 min)   │
  │   loss)               │          │           │                   │
  │ P2 — High             │ 35       │ <15 min   │ L1 → L2 → L3     │
  │   (degraded perf,     │          │           │ (2 hours)         │
  │   service impact)     │          │           │                   │
  │ P3 — Medium           │ 62       │ <1 hour   │ L1 → L2          │
  │   (warning,           │          │           │ (4 hours)         │
  │   capacity threshold) │          │           │                   │
  │ P4 — Low              │ 45       │ <4 hours  │ L1               │
  │   (informational,     │          │           │ (next business    │
  │   advisory)           │          │           │ day)              │
  │ ─────────────────── │ ────── │ ───────── │ ───────────────── │
  │ TOTAL               │ 160    │           │                   │
  └────────────────────────┴──────────┴───────────┴───────────────────┘

ALERT TOPOLOGY:
  Datadog Monitors (source) → PagerDuty (routing) → On-call (response)
  
  Routing rules:
    Infrastructure alerts → Infrastructure team (rotating on-call)
    Application alerts → Application team (service owner)
    Database alerts → Database team (DBA on-call)
    Security alerts → Security team (SOC on-call)
    Multi-service alerts → SRE team (cross-functional)
  
  Escalation policy:
    Level 1: Primary on-call (15 min acknowledgment)
    Level 2: Backup on-call (if L1 unacknowledged after 15 min)
    Level 3: Team lead (if L2 unacknowledged after 15 min)
    Level 4: Manager/CTO (if L3 unacknowledged after 15 min)
  
  On-call rotation:
    Frequency: Weekly rotation
    Duration: 24 hours/day (7 days)
    Max consecutive: 2 weeks (with 4-week break)
    Handoff: 30-minute overlap (call + documentation)
    Compensation: On-call stipend ($50/week) + PTO backup

ALERT QUALITY METRICS (January 2025):
  Total alerts: 1,240 (32.5/day avg.)
  Actionable alerts: 985 (79.4% — good)
  False positives: 255 (20.6% — target: <15%)
  
  Alert reduction (ongoing):
    Alert tuning: 12 monitors adjusted (threshold optimization)
    Alert grouping: Correlation rules (reduce noise)
    Dead band: 30 minutes (prevent flapping)
    Maintenance windows: Suppress during known maintenance
  
  Alert fatigue prevention:
    - No more than 3 concurrent P1/P2 alerts (team capacity)
    - Alert consolidation (related alerts → single incident)
    - Quiet hours: Non-critical alerts suppressed (12 AM - 6 AM)
    - Runbook required: Every alert → documented response

ALERT STATISTICS (by team):
  ┌──────────────────────┬──────────┬──────────┬──────────┐
  │ Team                 │ Alerts   │ FPR      │ Avg Res  │
  ├──────────────────────┼──────────┼──────────┼──────────┤
  │ Infrastructure       │ 320      │ 18%      │ 25 min   │
  │ Applications         │ 285      │ 15%      │ 35 min   │
  │ Database             │ 95       │ 8%       │ 20 min   │
  │ Security             │ 180      │ 22%      │ 15 min   │
  │ SRE                  │ 145      │ 12%      │ 30 min   │
  │ Network              │ 110      │ 10%      │ 20 min   │
  │ DevOps/Platform      │ 105      │ 14%      │ 25 min   │
  │ ────────────────── │ ────── │ ────── │ ────── │
  │ TOTAL              │ 1,240  │ 20.6%  │ 24 min   │
  └──────────────────────┴──────────┴──────────┴──────────┘

  Notes:
    - FPR = False Positive Rate (lower is better)
    - Target FPR: <15% (tuning in progress)
    - Security alerts higher FPR (investigation vs. action)
    - Avg. resolution time: 24 minutes (all teams)

AUTOMATED RESPONSE (alert → action):
  Auto-remediation playbooks:
    1. High CPU → Scale up (auto-approved, K8s HPA)
    2. High memory → Restart pod (safe, K8s)
    3. Disk >90% → Clean temp files (auto, Linux)
    4. SSL cert expiring (30 days) → Renew (cert-manager)
    5. Health check fail → Restart service (auto, systemd)
    6. Database connections >80% → Scale read replicas (auto, RDS)
    7. CDN error rate >5% → Failover (auto, Route53)
  
  Auto-remediation rate: 35% (of alerts)
  Success rate: 92% (auto-remediation resolves issue)
  Failed auto-remediation: Escalated to on-call (immediate)

SLI/SLO/SLA Management

Reliability Engineering

SERVICE LEVEL MANAGEMENT:
════════════════════════

SLO FRAMEWORK (Google SRE methodology):
  Service Level Indicators (SLIs): What we measure
  Service Level Objectives (SLOs): Target reliability
  Service Level Agreements (SLAs): Customer commitments
  Error Budget: Allowed failures (1 - SLO)

CRITICAL SERVICES SLOs:
  ┌──────────────────────────┬──────────┬──────────┬──────────┬──────────┐
  │ Service                  │ SLO      │ Budget   │ Actual   │ Status   │
  │                          │          │ (monthly)│ (Jan)    │          │
  ├──────────────────────────┼──────────┼──────────┼──────────┼──────────┤
  │ Customer API             │ 99.95%   │ 21.6 min │ 8.2 min  │ ✓ Green  │
  │   (availability)         │          │ downtime │ downtime │          │
  │ Customer API             │ <200ms   │ 0.05%    │ 0.03%    │ ✓ Green  │
  │   (latency p99)          │ p99      │ requests │ requests │          │
  │ Internal web app         │ 99.9%    │ 43.2 min │ 12.5 min │ ✓ Green  │
  │   (availability)         │          │ downtime │ downtime │          │
  │ CI/CD pipeline           │ 99.5%    │ 216 min  │ 45 min   │ ✓ Green  │
  │   (availability)         │          │ downtime │ downtime │          │
  │ Database                 │ 99.99%   │ 4.3 min  │ 1.2 min  │ ✓ Green  │
  │   (availability)         │          │ downtime │ downtime │          │
  │ Email service            │ 99.9%    │ 43.2 min │ 0 min    │ ✓ Green  │
  │   (delivery)             │          │ failures │ failures │          │
  │ Backup system            │ 99.9%    │ 43.2 min │ 5.8 min  │ ✓ Green  │
  │   (completion)           │          │ failures │ failures │          │
  │ Authentication (Okta)    │ 99.95%   │ 21.6 min │ 3.1 min  │ ✓ Green  │
  │   (availability)         │          │ downtime │ downtime │          │
  │ File storage (S3/Blob)   │ 99.99%   │ 4.3 min  │ 0 min    │ ✓ Green  │
  │   (availability)         │          │ downtime │ downtime │          │
  └──────────────────────────┴──────────┴──────────┴──────────┴──────────┘

  Notes:
    - All SLOs met (January 2025)
    - Error budget burn rate: Healthy (all services)
    - No SLO breaches in past 6 months
    - Quarterly SLO review (adjust as needed)

SLO BURN RATE MONITORING:
  1x burn rate: 1 month budget consumed in 1 month (normal)
  10x burn rate: 1 month budget consumed in 1 week (warning)
  14x burn rate: 1 month budget consumed in 3.5 days (alert)
  52x burn rate: 1 month budget consumed in 1 day (emergency)
  
  Current burn rates (all services): <1x (healthy)
  Alert threshold: 10x burn rate (triggers investigation)
  Emergency threshold: 52x burn rate (triggers incident)

SLA MANAGEMENT (Customer-facing):
  Enterprise SLAs (contractual):
    API availability: 99.95% (credit: 10% per 0.1% below)
    API latency (p99): <200ms (credit: 5% per 50ms above)
    Support response: <4 hours (credit: 2% per hour above)
    Data durability: 99.999999999% (11 nines — S3 native)
  
  SLA compliance (January 2025):
    API availability: 99.992% (above 99.95% SLA) ✓
    API latency (p99): 145ms (below 200ms SLA) ✓
    Support response: 2.8 hours avg. (below 4h SLA) ✓
    Data durability: 100% (no data loss) ✓
    SLA credits issued: $0 (all SLAs met)

CAPACITY PLANNING:
  Infrastructure utilization (January 2025):
    ┌──────────────────────────┬──────────┬──────────┬──────────┐
    │ Resource                 │ Current  │ Target   │ Headroom │
    ├──────────────────────────┼──────────┼──────────┼──────────┤
    │ CPU (servers)            │ 45%      │ <70%     │ 55%      │
    │ Memory (servers)         │ 58%      │ <75%     │ 42%      │
    │ Disk (block storage)     │ 52%      │ <80%     │ 48%      │
    │ Network bandwidth        │ 35%      │ <70%     │ 65%      │
    │ Database connections     │ 48%      │ <75%     │ 52%      │
    │ K8s cluster nodes        │ 55%      │ <70%     │ 45%      │
    │ Cloud budget             │ 62%      │ <80%     │ 38%      │
    │ ────────────────────── │ ────── │ ────── │ ────── │
    │ Status                 │ Healthy │ Healthy │ Healthy  │
    └──────────────────────────┴──────────┴──────────┴──────────┘

  Capacity forecasting:
    Current growth rate: 15%/quarter (users + data)
    12-month projection:
      CPU: 45% → 82% (scale needed in ~9 months)
      Memory: 58% → 105% (scale needed in ~7 months) ⚠️
      Disk: 52% → 95% (scale needed in ~10 months)
      Network: 35% → 65% (adequate for 12 months)
      Database: 48% → 88% (scale needed in ~8 months)
  
  Scaling strategy:
    Vertical scaling (larger instances): Memory, CPU (immediate)
    Horizontal scaling (more instances): App tier, K8s nodes
    Storage expansion: Auto-scale EBS/Blob (AWS/Azure)
    Database: Read replicas + partitioning (planned)
    Budget: $50K reserved for Q2 capacity expansion
  
  Capacity review cadence: Quarterly (next: Q1 — March 2025)

Output

Infrastructure Health Dashboard

INFRASTRUCTURE HEALTH DASHBOARD — Jan 2025
═══════════════════════════════════════════

Availability:
  Overall uptime: 99.97% (15.8 min downtime total)
  Critical services: 100% (all SLOs met)
  P1 incidents: 0
  P2 incidents: 2 (resolved within SLA)
  Mean time to detect: 4.2 minutes
  Mean time to resolve: 24 minutes (avg.)

Monitoring:
  Assets monitored: 1,293
  Agent coverage: 99.1%
  Metrics (series): 85K
  Logs/minute: 180K
  Traces/day: 2.5M
  Data gaps: 0.2/day (auto-recovered)

Alerting:
  Active monitors: 160
  Alerts (January): 1,240
  False positive rate: 20.6% (target: <15%)
  Auto-remediation: 35% of alerts (92% success)
  On-call satisfaction: 4.1/5.0

Reliability:
  SLOs defined: 10 (critical services)
  SLO compliance: 100% (all met)
  Error budget: Healthy (<1x burn rate all)
  SLA compliance: 100% (all met)
  MTBF: 380 hours (avg.)
  MTTR: 24 minutes (avg.)

Capacity:
  CPU utilization: 45% (target: <70%) ✓
  Memory utilization: 58% (target: <75%) ✓
  Disk utilization: 52% (target: <80%) ✓
  Network bandwidth: 35% (target: <70%) ✓
  12-month forecast: Memory scaling needed (~7 months)
  Budget: 62% of annual (on track)

Actions:
  1. Alert tuning (reduce FPR from 20.6% to <15%)
  2. Capacity planning review (Q1 — March)
  3. Memory scaling plan (design — February)
  4. SLO review (quarterly — April)
  5. Monitoring tool review (annual — Q2)

Integration Points

Edge Cases