---
name: infrastructure-monitoring
description: Manage infrastructure monitoring and observability including metrics collection, log aggregation, distributed tracing, alerting and escalation, SLA/SLO tracking, capacity planning, and incident correlation. Use when monitoring system health, managing alerts, investigating performance issues, tracking SLIs/SLOs, or planning infrastructure capacity. Triggers on phrases like "monitoring", "observability", "metrics", "alerts", "SLO", "SLA", "log aggregation", "distributed tracing", "dashboard", "capacity planning", "system health", "uptime", "performance monitoring", "APM", "log analysis", "incident correlation", "Prometheus", "Grafana", "Datadog", "PagerDuty".
---

# Infrastructure Monitoring & Observability

Ensure system reliability through comprehensive monitoring, proactive alerting, and data-driven capacity planning.

## Monitoring Stack & Architecture

### Observability Framework

```
OBSERVABILITY FRAMEWORK:
════════════════════════

MONITORING STACK:
  Metrics: Datadog (primary) + Prometheus (K8s native)
  Logs: Datadog Log Management + ELK Stack (backup)
  Tracing: Datadog APM + OpenTelemetry (standard)
  Alerting: Datadog Monitors + PagerDuty (escalation)
  Dashboards: Datadog Dashboards + Grafana (custom)
  Uptime: Datadog Synthetic Monitoring + UptimeRobot (backup)

MONITORING COVERAGE:
  ┌──────────────────────────┬──────────┬─────────────────────┐
  │ Layer                    │ Count    │ Tool                │
  ├──────────────────────────┼──────────┼─────────────────────┤
  │ Cloud infrastructure     │ 115      │ Datadog AWS/Azure   │
  │   (AWS + Azure)          │          │ integration         │
  │ Kubernetes clusters      │ 8        │ Prometheus + DD     │
  │ Kubernetes pods          │ 380      │ Prometheus + DD     │
  │ VMs (Linux + Windows)   │ 165      │ Datadog Agent       │
  │ Containers (non-K8s)     │ 45       │ Datadog Agent       │
  │ Databases                │ 28       │ Datadog DB monitor  │
  │   (PostgreSQL, MySQL,   │          │                     │
  │   MongoDB, Redis)        │          │                     │
  │ Networks (VPC, load      │ 42       │ Datadog Network     │
  │   balancers, DNS)        │          │                     │
  │ Applications             │ 35       │ Datadog APM         │
  │   (microservices, APIs)  │          │                     │
  │ CDNs/WAF                 │ 5        │ Datadog Synthetics  │
  │ SaaS applications        │ 18       │ Datadog SaaS        │
  │   (Salesforce, M365,     │          │ integration         │
  │   Slack, etc.)           │          │                     │
  │ Endpoints (workstations) │ 450      │ CrowdStrike (basic) │
  │ ────────────────────── │ ────── │ ────────────────── │
  │ TOTAL                  │ 1,293  │ Multiple            │
  └──────────────────────────┴──────────┴─────────────────────┘

  Monitoring depth:
    Infrastructure: CPU, memory, disk, network (30-second interval)
    Application: Response time, error rate, throughput (10-second interval)
    Database: Queries/second, connections, latency (15-second interval)
    Network: Bandwidth, packet loss, latency (30-second interval)
    Business: API calls, transactions, user activity (1-minute interval)

METRICS COLLECTION:
  Total metrics (series): 85K (unique)
  Data points/minute: 2.5M
  Data points/day: 3.6B
  Storage: 90 days (hot) + 1 year (cold)
  Cost optimization:
    - Cardinality management (tag limiting)
    - Metric tiering (critical: 10s, standard: 30s, background: 5min)
    - Auto-retention (low-value metrics → cold after 30 days)
    - Monthly cost: ~$8K (Datadog) + $3K (infrastructure)

LOG MANAGEMENT:
  Total logs/minute: 180K
  Total logs/day: 260M
  Log sources: 95 (servers, containers, applications, databases, networks)
  Retention: 30 days (standard) + 1 year (security logs)
  Parsing: Structured (JSON preferred) + regex parsing (legacy)
  Indexing: Full text + structured fields
  Cost optimization:
    - Log sampling (info/debug: 10%)
    - Log filtering (exclude noisy patterns)
    - Auto-archive (logs >30 days → cold storage)

DISTRIBUTED TRACING:
  Trace volume: 2.5M traces/day
  Sample rate: 10% (standard) + 100% (error traces)
  Propagation: W3C Trace Context (standard)
  Coverage: 85% of microservices (instrumented)
  Instruments: Datadog APM agents (Java, Python, Node.js, Go)
  End-to-end visibility: API → Service → Database → Cache

MONITORING HEALTH:
  Agent coverage: 99.1% (12 assets without agent — decommissioning)
  Agent uptime: 99.8% (avg. per agent)
  Data gap incidents: 0.2/day (avg. — auto-recovered)
  Alert noise: Managed (see alerting section)
  Dashboard freshness: Real-time (<30 second latency)
```

## Alerting & Escalation

### Intelligent Alerting Framework

```
ALERTING FRAMEWORK:
════════════════════

ALERT CLASSIFICATION:
  ┌────────────────────────┬──────────┬───────────┬───────────────────┐
  │ Severity              │ Count    │ Response  │ Escalation        │
  │                       │ (active) │ Time      │ Path              │
  ├────────────────────────┼──────────┼───────────┼───────────────────┤
  │ P1 — Critical         │ 18       │ Immediate │ L1 → L2 → L3     │
  │   (system down, data  │          │ (<5 min)  │ → CTO (30 min)   │
  │   loss)               │          │           │                   │
  │ P2 — High             │ 35       │ <15 min   │ L1 → L2 → L3     │
  │   (degraded perf,     │          │           │ (2 hours)         │
  │   service impact)     │          │           │                   │
  │ P3 — Medium           │ 62       │ <1 hour   │ L1 → L2          │
  │   (warning,           │          │           │ (4 hours)         │
  │   capacity threshold) │          │           │                   │
  │ P4 — Low              │ 45       │ <4 hours  │ L1               │
  │   (informational,     │          │           │ (next business    │
  │   advisory)           │          │           │ day)              │
  │ ─────────────────── │ ────── │ ───────── │ ───────────────── │
  │ TOTAL               │ 160    │           │                   │
  └────────────────────────┴──────────┴───────────┴───────────────────┘

ALERT TOPOLOGY:
  Datadog Monitors (source) → PagerDuty (routing) → On-call (response)
  
  Routing rules:
    Infrastructure alerts → Infrastructure team (rotating on-call)
    Application alerts → Application team (service owner)
    Database alerts → Database team (DBA on-call)
    Security alerts → Security team (SOC on-call)
    Multi-service alerts → SRE team (cross-functional)
  
  Escalation policy:
    Level 1: Primary on-call (15 min acknowledgment)
    Level 2: Backup on-call (if L1 unacknowledged after 15 min)
    Level 3: Team lead (if L2 unacknowledged after 15 min)
    Level 4: Manager/CTO (if L3 unacknowledged after 15 min)
  
  On-call rotation:
    Frequency: Weekly rotation
    Duration: 24 hours/day (7 days)
    Max consecutive: 2 weeks (with 4-week break)
    Handoff: 30-minute overlap (call + documentation)
    Compensation: On-call stipend ($50/week) + PTO backup

ALERT QUALITY METRICS (January 2025):
  Total alerts: 1,240 (32.5/day avg.)
  Actionable alerts: 985 (79.4% — good)
  False positives: 255 (20.6% — target: <15%)
  
  Alert reduction (ongoing):
    Alert tuning: 12 monitors adjusted (threshold optimization)
    Alert grouping: Correlation rules (reduce noise)
    Dead band: 30 minutes (prevent flapping)
    Maintenance windows: Suppress during known maintenance
  
  Alert fatigue prevention:
    - No more than 3 concurrent P1/P2 alerts (team capacity)
    - Alert consolidation (related alerts → single incident)
    - Quiet hours: Non-critical alerts suppressed (12 AM - 6 AM)
    - Runbook required: Every alert → documented response

ALERT STATISTICS (by team):
  ┌──────────────────────┬──────────┬──────────┬──────────┐
  │ Team                 │ Alerts   │ FPR      │ Avg Res  │
  ├──────────────────────┼──────────┼──────────┼──────────┤
  │ Infrastructure       │ 320      │ 18%      │ 25 min   │
  │ Applications         │ 285      │ 15%      │ 35 min   │
  │ Database             │ 95       │ 8%       │ 20 min   │
  │ Security             │ 180      │ 22%      │ 15 min   │
  │ SRE                  │ 145      │ 12%      │ 30 min   │
  │ Network              │ 110      │ 10%      │ 20 min   │
  │ DevOps/Platform      │ 105      │ 14%      │ 25 min   │
  │ ────────────────── │ ────── │ ────── │ ────── │
  │ TOTAL              │ 1,240  │ 20.6%  │ 24 min   │
  └──────────────────────┴──────────┴──────────┴──────────┘

  Notes:
    - FPR = False Positive Rate (lower is better)
    - Target FPR: <15% (tuning in progress)
    - Security alerts higher FPR (investigation vs. action)
    - Avg. resolution time: 24 minutes (all teams)

AUTOMATED RESPONSE (alert → action):
  Auto-remediation playbooks:
    1. High CPU → Scale up (auto-approved, K8s HPA)
    2. High memory → Restart pod (safe, K8s)
    3. Disk >90% → Clean temp files (auto, Linux)
    4. SSL cert expiring (30 days) → Renew (cert-manager)
    5. Health check fail → Restart service (auto, systemd)
    6. Database connections >80% → Scale read replicas (auto, RDS)
    7. CDN error rate >5% → Failover (auto, Route53)
  
  Auto-remediation rate: 35% (of alerts)
  Success rate: 92% (auto-remediation resolves issue)
  Failed auto-remediation: Escalated to on-call (immediate)
```

## SLI/SLO/SLA Management

### Reliability Engineering

```
SERVICE LEVEL MANAGEMENT:
════════════════════════

SLO FRAMEWORK (Google SRE methodology):
  Service Level Indicators (SLIs): What we measure
  Service Level Objectives (SLOs): Target reliability
  Service Level Agreements (SLAs): Customer commitments
  Error Budget: Allowed failures (1 - SLO)

CRITICAL SERVICES SLOs:
  ┌──────────────────────────┬──────────┬──────────┬──────────┬──────────┐
  │ Service                  │ SLO      │ Budget   │ Actual   │ Status   │
  │                          │          │ (monthly)│ (Jan)    │          │
  ├──────────────────────────┼──────────┼──────────┼──────────┼──────────┤
  │ Customer API             │ 99.95%   │ 21.6 min │ 8.2 min  │ ✓ Green  │
  │   (availability)         │          │ downtime │ downtime │          │
  │ Customer API             │ <200ms   │ 0.05%    │ 0.03%    │ ✓ Green  │
  │   (latency p99)          │ p99      │ requests │ requests │          │
  │ Internal web app         │ 99.9%    │ 43.2 min │ 12.5 min │ ✓ Green  │
  │   (availability)         │          │ downtime │ downtime │          │
  │ CI/CD pipeline           │ 99.5%    │ 216 min  │ 45 min   │ ✓ Green  │
  │   (availability)         │          │ downtime │ downtime │          │
  │ Database                 │ 99.99%   │ 4.3 min  │ 1.2 min  │ ✓ Green  │
  │   (availability)         │          │ downtime │ downtime │          │
  │ Email service            │ 99.9%    │ 43.2 min │ 0 min    │ ✓ Green  │
  │   (delivery)             │          │ failures │ failures │          │
  │ Backup system            │ 99.9%    │ 43.2 min │ 5.8 min  │ ✓ Green  │
  │   (completion)           │          │ failures │ failures │          │
  │ Authentication (Okta)    │ 99.95%   │ 21.6 min │ 3.1 min  │ ✓ Green  │
  │   (availability)         │          │ downtime │ downtime │          │
  │ File storage (S3/Blob)   │ 99.99%   │ 4.3 min  │ 0 min    │ ✓ Green  │
  │   (availability)         │          │ downtime │ downtime │          │
  └──────────────────────────┴──────────┴──────────┴──────────┴──────────┘

  Notes:
    - All SLOs met (January 2025)
    - Error budget burn rate: Healthy (all services)
    - No SLO breaches in past 6 months
    - Quarterly SLO review (adjust as needed)

SLO BURN RATE MONITORING:
  1x burn rate: 1 month budget consumed in 1 month (normal)
  10x burn rate: 1 month budget consumed in 1 week (warning)
  14x burn rate: 1 month budget consumed in 3.5 days (alert)
  52x burn rate: 1 month budget consumed in 1 day (emergency)
  
  Current burn rates (all services): <1x (healthy)
  Alert threshold: 10x burn rate (triggers investigation)
  Emergency threshold: 52x burn rate (triggers incident)

SLA MANAGEMENT (Customer-facing):
  Enterprise SLAs (contractual):
    API availability: 99.95% (credit: 10% per 0.1% below)
    API latency (p99): <200ms (credit: 5% per 50ms above)
    Support response: <4 hours (credit: 2% per hour above)
    Data durability: 99.999999999% (11 nines — S3 native)
  
  SLA compliance (January 2025):
    API availability: 99.992% (above 99.95% SLA) ✓
    API latency (p99): 145ms (below 200ms SLA) ✓
    Support response: 2.8 hours avg. (below 4h SLA) ✓
    Data durability: 100% (no data loss) ✓
    SLA credits issued: $0 (all SLAs met)

CAPACITY PLANNING:
  Infrastructure utilization (January 2025):
    ┌──────────────────────────┬──────────┬──────────┬──────────┐
    │ Resource                 │ Current  │ Target   │ Headroom │
    ├──────────────────────────┼──────────┼──────────┼──────────┤
    │ CPU (servers)            │ 45%      │ <70%     │ 55%      │
    │ Memory (servers)         │ 58%      │ <75%     │ 42%      │
    │ Disk (block storage)     │ 52%      │ <80%     │ 48%      │
    │ Network bandwidth        │ 35%      │ <70%     │ 65%      │
    │ Database connections     │ 48%      │ <75%     │ 52%      │
    │ K8s cluster nodes        │ 55%      │ <70%     │ 45%      │
    │ Cloud budget             │ 62%      │ <80%     │ 38%      │
    │ ────────────────────── │ ────── │ ────── │ ────── │
    │ Status                 │ Healthy │ Healthy │ Healthy  │
    └──────────────────────────┴──────────┴──────────┴──────────┘

  Capacity forecasting:
    Current growth rate: 15%/quarter (users + data)
    12-month projection:
      CPU: 45% → 82% (scale needed in ~9 months)
      Memory: 58% → 105% (scale needed in ~7 months) ⚠️
      Disk: 52% → 95% (scale needed in ~10 months)
      Network: 35% → 65% (adequate for 12 months)
      Database: 48% → 88% (scale needed in ~8 months)
  
  Scaling strategy:
    Vertical scaling (larger instances): Memory, CPU (immediate)
    Horizontal scaling (more instances): App tier, K8s nodes
    Storage expansion: Auto-scale EBS/Blob (AWS/Azure)
    Database: Read replicas + partitioning (planned)
    Budget: $50K reserved for Q2 capacity expansion
  
  Capacity review cadence: Quarterly (next: Q1 — March 2025)
```

## Output

### Infrastructure Health Dashboard

```
INFRASTRUCTURE HEALTH DASHBOARD — Jan 2025
═══════════════════════════════════════════

Availability:
  Overall uptime: 99.97% (15.8 min downtime total)
  Critical services: 100% (all SLOs met)
  P1 incidents: 0
  P2 incidents: 2 (resolved within SLA)
  Mean time to detect: 4.2 minutes
  Mean time to resolve: 24 minutes (avg.)

Monitoring:
  Assets monitored: 1,293
  Agent coverage: 99.1%
  Metrics (series): 85K
  Logs/minute: 180K
  Traces/day: 2.5M
  Data gaps: 0.2/day (auto-recovered)

Alerting:
  Active monitors: 160
  Alerts (January): 1,240
  False positive rate: 20.6% (target: <15%)
  Auto-remediation: 35% of alerts (92% success)
  On-call satisfaction: 4.1/5.0

Reliability:
  SLOs defined: 10 (critical services)
  SLO compliance: 100% (all met)
  Error budget: Healthy (<1x burn rate all)
  SLA compliance: 100% (all met)
  MTBF: 380 hours (avg.)
  MTTR: 24 minutes (avg.)

Capacity:
  CPU utilization: 45% (target: <70%) ✓
  Memory utilization: 58% (target: <75%) ✓
  Disk utilization: 52% (target: <80%) ✓
  Network bandwidth: 35% (target: <70%) ✓
  12-month forecast: Memory scaling needed (~7 months)
  Budget: 62% of annual (on track)

Actions:
  1. Alert tuning (reduce FPR from 20.6% to <15%)
  2. Capacity planning review (Q1 — March)
  3. Memory scaling plan (design — February)
  4. SLO review (quarterly — April)
  5. Monitoring tool review (annual — Q2)
```

## Integration Points

- Cloud providers (AWS CloudWatch, Azure Monitor, GCP Monitoring): Native metrics
- Container platforms (Kubernetes, Docker, ECS): Resource metrics, events
- Databases (PostgreSQL, MySQL, MongoDB, Redis): Performance, connections
- CI/CD platforms (GitHub Actions, Jenkins, GitLab CI): Pipeline metrics
- Logging platforms (ELK, Datadog Logs, Splunk): Log aggregation, search
- APM tools (Datadog APM, New Relic, AppDynamics): Application performance
- Network monitoring (Datadog Network, PRTG): Network metrics
- Configuration management (Ansible, Terraform): Infrastructure state
- ITSM platforms (ServiceNow, Jira): Incident, change management
- Communication (PagerDuty, Slack, Teams): Alerting, escalation
- CMDB: Asset inventory, relationship mapping
- Cost management (CloudHealth, AWS Cost Explorer): Budget tracking

## Edge Cases

- **Cascading failure**: Cross-service correlation; circuit breaker activation; blast radius containment; post-mortem
- **Monitoring blackout (agent failure)**: Fallback monitoring; synthetic checks; manual verification; agent reinstall
- **Alert storm**: Dead band enforcement; alert grouping; rate limiting; on-call support; root cause focus
- **SLO breach imminent**: Error budget alert; performance optimization; capacity scaling; communication
- **Data center outage**: Multi-AZ/region failover; DNS failover; backup activation; customer communication
- **Capacity spike (unexpected)**: Auto-scaling activation; burst capacity; cost monitoring; load shedding
- **Monitoring tool outage**: Backup tool activation; manual health checks; vendor support; SLA credit tracking
- **Noisy neighbor (shared infra)**: Resource isolation; limit enforcement; tenant migration; performance analysis
- **Log volume spike**: Log sampling; cost alert; retention adjustment; root cause; capacity
- **Metric cardinality explosion**: Tag limiting; metric tiering; cost optimization; cleanup
