IT AI Skill
Infrastructure Monitoring
Manage infrastructure monitoring and observability including metrics collection, log aggregation, distributed tracing, alerting and escalation, SLA/SLO tracking, capacity planning, and incident correlation. Use when monitoring system health, managing alerts...
Infrastructure Monitoring & Observability
Ensure system reliability through comprehensive monitoring, proactive alerting, and data-driven capacity planning.
Monitoring Stack & Architecture
Observability Framework
OBSERVABILITY FRAMEWORK:
════════════════════════
MONITORING STACK:
Metrics: Datadog (primary) + Prometheus (K8s native)
Logs: Datadog Log Management + ELK Stack (backup)
Tracing: Datadog APM + OpenTelemetry (standard)
Alerting: Datadog Monitors + PagerDuty (escalation)
Dashboards: Datadog Dashboards + Grafana (custom)
Uptime: Datadog Synthetic Monitoring + UptimeRobot (backup)
MONITORING COVERAGE:
┌──────────────────────────┬──────────┬─────────────────────┐
│ Layer │ Count │ Tool │
├──────────────────────────┼──────────┼─────────────────────┤
│ Cloud infrastructure │ 115 │ Datadog AWS/Azure │
│ (AWS + Azure) │ │ integration │
│ Kubernetes clusters │ 8 │ Prometheus + DD │
│ Kubernetes pods │ 380 │ Prometheus + DD │
│ VMs (Linux + Windows) │ 165 │ Datadog Agent │
│ Containers (non-K8s) │ 45 │ Datadog Agent │
│ Databases │ 28 │ Datadog DB monitor │
│ (PostgreSQL, MySQL, │ │ │
│ MongoDB, Redis) │ │ │
│ Networks (VPC, load │ 42 │ Datadog Network │
│ balancers, DNS) │ │ │
│ Applications │ 35 │ Datadog APM │
│ (microservices, APIs) │ │ │
│ CDNs/WAF │ 5 │ Datadog Synthetics │
│ SaaS applications │ 18 │ Datadog SaaS │
│ (Salesforce, M365, │ │ integration │
│ Slack, etc.) │ │ │
│ Endpoints (workstations) │ 450 │ CrowdStrike (basic) │
│ ────────────────────── │ ────── │ ────────────────── │
│ TOTAL │ 1,293 │ Multiple │
└──────────────────────────┴──────────┴─────────────────────┘
Monitoring depth:
Infrastructure: CPU, memory, disk, network (30-second interval)
Application: Response time, error rate, throughput (10-second interval)
Database: Queries/second, connections, latency (15-second interval)
Network: Bandwidth, packet loss, latency (30-second interval)
Business: API calls, transactions, user activity (1-minute interval)
METRICS COLLECTION:
Total metrics (series): 85K (unique)
Data points/minute: 2.5M
Data points/day: 3.6B
Storage: 90 days (hot) + 1 year (cold)
Cost optimization:
- Cardinality management (tag limiting)
- Metric tiering (critical: 10s, standard: 30s, background: 5min)
- Auto-retention (low-value metrics → cold after 30 days)
- Monthly cost: ~$8K (Datadog) + $3K (infrastructure)
LOG MANAGEMENT:
Total logs/minute: 180K
Total logs/day: 260M
Log sources: 95 (servers, containers, applications, databases, networks)
Retention: 30 days (standard) + 1 year (security logs)
Parsing: Structured (JSON preferred) + regex parsing (legacy)
Indexing: Full text + structured fields
Cost optimization:
- Log sampling (info/debug: 10%)
- Log filtering (exclude noisy patterns)
- Auto-archive (logs >30 days → cold storage)
DISTRIBUTED TRACING:
Trace volume: 2.5M traces/day
Sample rate: 10% (standard) + 100% (error traces)
Propagation: W3C Trace Context (standard)
Coverage: 85% of microservices (instrumented)
Instruments: Datadog APM agents (Java, Python, Node.js, Go)
End-to-end visibility: API → Service → Database → Cache
MONITORING HEALTH:
Agent coverage: 99.1% (12 assets without agent — decommissioning)
Agent uptime: 99.8% (avg. per agent)
Data gap incidents: 0.2/day (avg. — auto-recovered)
Alert noise: Managed (see alerting section)
Dashboard freshness: Real-time (<30 second latency)
Alerting & Escalation
Intelligent Alerting Framework
ALERTING FRAMEWORK:
════════════════════
ALERT CLASSIFICATION:
┌────────────────────────┬──────────┬───────────┬───────────────────┐
│ Severity │ Count │ Response │ Escalation │
│ │ (active) │ Time │ Path │
├────────────────────────┼──────────┼───────────┼───────────────────┤
│ P1 — Critical │ 18 │ Immediate │ L1 → L2 → L3 │
│ (system down, data │ │ (<5 min) │ → CTO (30 min) │
│ loss) │ │ │ │
│ P2 — High │ 35 │ <15 min │ L1 → L2 → L3 │
│ (degraded perf, │ │ │ (2 hours) │
│ service impact) │ │ │ │
│ P3 — Medium │ 62 │ <1 hour │ L1 → L2 │
│ (warning, │ │ │ (4 hours) │
│ capacity threshold) │ │ │ │
│ P4 — Low │ 45 │ <4 hours │ L1 │
│ (informational, │ │ │ (next business │
│ advisory) │ │ │ day) │
│ ─────────────────── │ ────── │ ───────── │ ───────────────── │
│ TOTAL │ 160 │ │ │
└────────────────────────┴──────────┴───────────┴───────────────────┘
ALERT TOPOLOGY:
Datadog Monitors (source) → PagerDuty (routing) → On-call (response)
Routing rules:
Infrastructure alerts → Infrastructure team (rotating on-call)
Application alerts → Application team (service owner)
Database alerts → Database team (DBA on-call)
Security alerts → Security team (SOC on-call)
Multi-service alerts → SRE team (cross-functional)
Escalation policy:
Level 1: Primary on-call (15 min acknowledgment)
Level 2: Backup on-call (if L1 unacknowledged after 15 min)
Level 3: Team lead (if L2 unacknowledged after 15 min)
Level 4: Manager/CTO (if L3 unacknowledged after 15 min)
On-call rotation:
Frequency: Weekly rotation
Duration: 24 hours/day (7 days)
Max consecutive: 2 weeks (with 4-week break)
Handoff: 30-minute overlap (call + documentation)
Compensation: On-call stipend ($50/week) + PTO backup
ALERT QUALITY METRICS (January 2025):
Total alerts: 1,240 (32.5/day avg.)
Actionable alerts: 985 (79.4% — good)
False positives: 255 (20.6% — target: <15%)
Alert reduction (ongoing):
Alert tuning: 12 monitors adjusted (threshold optimization)
Alert grouping: Correlation rules (reduce noise)
Dead band: 30 minutes (prevent flapping)
Maintenance windows: Suppress during known maintenance
Alert fatigue prevention:
- No more than 3 concurrent P1/P2 alerts (team capacity)
- Alert consolidation (related alerts → single incident)
- Quiet hours: Non-critical alerts suppressed (12 AM - 6 AM)
- Runbook required: Every alert → documented response
ALERT STATISTICS (by team):
┌──────────────────────┬──────────┬──────────┬──────────┐
│ Team │ Alerts │ FPR │ Avg Res │
├──────────────────────┼──────────┼──────────┼──────────┤
│ Infrastructure │ 320 │ 18% │ 25 min │
│ Applications │ 285 │ 15% │ 35 min │
│ Database │ 95 │ 8% │ 20 min │
│ Security │ 180 │ 22% │ 15 min │
│ SRE │ 145 │ 12% │ 30 min │
│ Network │ 110 │ 10% │ 20 min │
│ DevOps/Platform │ 105 │ 14% │ 25 min │
│ ────────────────── │ ────── │ ────── │ ────── │
│ TOTAL │ 1,240 │ 20.6% │ 24 min │
└──────────────────────┴──────────┴──────────┴──────────┘
Notes:
- FPR = False Positive Rate (lower is better)
- Target FPR: <15% (tuning in progress)
- Security alerts higher FPR (investigation vs. action)
- Avg. resolution time: 24 minutes (all teams)
AUTOMATED RESPONSE (alert → action):
Auto-remediation playbooks:
1. High CPU → Scale up (auto-approved, K8s HPA)
2. High memory → Restart pod (safe, K8s)
3. Disk >90% → Clean temp files (auto, Linux)
4. SSL cert expiring (30 days) → Renew (cert-manager)
5. Health check fail → Restart service (auto, systemd)
6. Database connections >80% → Scale read replicas (auto, RDS)
7. CDN error rate >5% → Failover (auto, Route53)
Auto-remediation rate: 35% (of alerts)
Success rate: 92% (auto-remediation resolves issue)
Failed auto-remediation: Escalated to on-call (immediate)
SLI/SLO/SLA Management
Reliability Engineering
SERVICE LEVEL MANAGEMENT:
════════════════════════
SLO FRAMEWORK (Google SRE methodology):
Service Level Indicators (SLIs): What we measure
Service Level Objectives (SLOs): Target reliability
Service Level Agreements (SLAs): Customer commitments
Error Budget: Allowed failures (1 - SLO)
CRITICAL SERVICES SLOs:
┌──────────────────────────┬──────────┬──────────┬──────────┬──────────┐
│ Service │ SLO │ Budget │ Actual │ Status │
│ │ │ (monthly)│ (Jan) │ │
├──────────────────────────┼──────────┼──────────┼──────────┼──────────┤
│ Customer API │ 99.95% │ 21.6 min │ 8.2 min │ ✓ Green │
│ (availability) │ │ downtime │ downtime │ │
│ Customer API │ <200ms │ 0.05% │ 0.03% │ ✓ Green │
│ (latency p99) │ p99 │ requests │ requests │ │
│ Internal web app │ 99.9% │ 43.2 min │ 12.5 min │ ✓ Green │
│ (availability) │ │ downtime │ downtime │ │
│ CI/CD pipeline │ 99.5% │ 216 min │ 45 min │ ✓ Green │
│ (availability) │ │ downtime │ downtime │ │
│ Database │ 99.99% │ 4.3 min │ 1.2 min │ ✓ Green │
│ (availability) │ │ downtime │ downtime │ │
│ Email service │ 99.9% │ 43.2 min │ 0 min │ ✓ Green │
│ (delivery) │ │ failures │ failures │ │
│ Backup system │ 99.9% │ 43.2 min │ 5.8 min │ ✓ Green │
│ (completion) │ │ failures │ failures │ │
│ Authentication (Okta) │ 99.95% │ 21.6 min │ 3.1 min │ ✓ Green │
│ (availability) │ │ downtime │ downtime │ │
│ File storage (S3/Blob) │ 99.99% │ 4.3 min │ 0 min │ ✓ Green │
│ (availability) │ │ downtime │ downtime │ │
└──────────────────────────┴──────────┴──────────┴──────────┴──────────┘
Notes:
- All SLOs met (January 2025)
- Error budget burn rate: Healthy (all services)
- No SLO breaches in past 6 months
- Quarterly SLO review (adjust as needed)
SLO BURN RATE MONITORING:
1x burn rate: 1 month budget consumed in 1 month (normal)
10x burn rate: 1 month budget consumed in 1 week (warning)
14x burn rate: 1 month budget consumed in 3.5 days (alert)
52x burn rate: 1 month budget consumed in 1 day (emergency)
Current burn rates (all services): <1x (healthy)
Alert threshold: 10x burn rate (triggers investigation)
Emergency threshold: 52x burn rate (triggers incident)
SLA MANAGEMENT (Customer-facing):
Enterprise SLAs (contractual):
API availability: 99.95% (credit: 10% per 0.1% below)
API latency (p99): <200ms (credit: 5% per 50ms above)
Support response: <4 hours (credit: 2% per hour above)
Data durability: 99.999999999% (11 nines — S3 native)
SLA compliance (January 2025):
API availability: 99.992% (above 99.95% SLA) ✓
API latency (p99): 145ms (below 200ms SLA) ✓
Support response: 2.8 hours avg. (below 4h SLA) ✓
Data durability: 100% (no data loss) ✓
SLA credits issued: $0 (all SLAs met)
CAPACITY PLANNING:
Infrastructure utilization (January 2025):
┌──────────────────────────┬──────────┬──────────┬──────────┐
│ Resource │ Current │ Target │ Headroom │
├──────────────────────────┼──────────┼──────────┼──────────┤
│ CPU (servers) │ 45% │ <70% │ 55% │
│ Memory (servers) │ 58% │ <75% │ 42% │
│ Disk (block storage) │ 52% │ <80% │ 48% │
│ Network bandwidth │ 35% │ <70% │ 65% │
│ Database connections │ 48% │ <75% │ 52% │
│ K8s cluster nodes │ 55% │ <70% │ 45% │
│ Cloud budget │ 62% │ <80% │ 38% │
│ ────────────────────── │ ────── │ ────── │ ────── │
│ Status │ Healthy │ Healthy │ Healthy │
└──────────────────────────┴──────────┴──────────┴──────────┘
Capacity forecasting:
Current growth rate: 15%/quarter (users + data)
12-month projection:
CPU: 45% → 82% (scale needed in ~9 months)
Memory: 58% → 105% (scale needed in ~7 months) ⚠️
Disk: 52% → 95% (scale needed in ~10 months)
Network: 35% → 65% (adequate for 12 months)
Database: 48% → 88% (scale needed in ~8 months)
Scaling strategy:
Vertical scaling (larger instances): Memory, CPU (immediate)
Horizontal scaling (more instances): App tier, K8s nodes
Storage expansion: Auto-scale EBS/Blob (AWS/Azure)
Database: Read replicas + partitioning (planned)
Budget: $50K reserved for Q2 capacity expansion
Capacity review cadence: Quarterly (next: Q1 — March 2025)
Output
Infrastructure Health Dashboard
INFRASTRUCTURE HEALTH DASHBOARD — Jan 2025
═══════════════════════════════════════════
Availability:
Overall uptime: 99.97% (15.8 min downtime total)
Critical services: 100% (all SLOs met)
P1 incidents: 0
P2 incidents: 2 (resolved within SLA)
Mean time to detect: 4.2 minutes
Mean time to resolve: 24 minutes (avg.)
Monitoring:
Assets monitored: 1,293
Agent coverage: 99.1%
Metrics (series): 85K
Logs/minute: 180K
Traces/day: 2.5M
Data gaps: 0.2/day (auto-recovered)
Alerting:
Active monitors: 160
Alerts (January): 1,240
False positive rate: 20.6% (target: <15%)
Auto-remediation: 35% of alerts (92% success)
On-call satisfaction: 4.1/5.0
Reliability:
SLOs defined: 10 (critical services)
SLO compliance: 100% (all met)
Error budget: Healthy (<1x burn rate all)
SLA compliance: 100% (all met)
MTBF: 380 hours (avg.)
MTTR: 24 minutes (avg.)
Capacity:
CPU utilization: 45% (target: <70%) ✓
Memory utilization: 58% (target: <75%) ✓
Disk utilization: 52% (target: <80%) ✓
Network bandwidth: 35% (target: <70%) ✓
12-month forecast: Memory scaling needed (~7 months)
Budget: 62% of annual (on track)
Actions:
1. Alert tuning (reduce FPR from 20.6% to <15%)
2. Capacity planning review (Q1 — March)
3. Memory scaling plan (design — February)
4. SLO review (quarterly — April)
5. Monitoring tool review (annual — Q2)
Integration Points
- Cloud providers (AWS CloudWatch, Azure Monitor, GCP Monitoring): Native metrics
- Container platforms (Kubernetes, Docker, ECS): Resource metrics, events
- Databases (PostgreSQL, MySQL, MongoDB, Redis): Performance, connections
- CI/CD platforms (GitHub Actions, Jenkins, GitLab CI): Pipeline metrics
- Logging platforms (ELK, Datadog Logs, Splunk): Log aggregation, search
- APM tools (Datadog APM, New Relic, AppDynamics): Application performance
- Network monitoring (Datadog Network, PRTG): Network metrics
- Configuration management (Ansible, Terraform): Infrastructure state
- ITSM platforms (ServiceNow, Jira): Incident, change management
- Communication (PagerDuty, Slack, Teams): Alerting, escalation
- CMDB: Asset inventory, relationship mapping
- Cost management (CloudHealth, AWS Cost Explorer): Budget tracking
Edge Cases
- Cascading failure: Cross-service correlation; circuit breaker activation; blast radius containment; post-mortem
- Monitoring blackout (agent failure): Fallback monitoring; synthetic checks; manual verification; agent reinstall
- Alert storm: Dead band enforcement; alert grouping; rate limiting; on-call support; root cause focus
- SLO breach imminent: Error budget alert; performance optimization; capacity scaling; communication
- Data center outage: Multi-AZ/region failover; DNS failover; backup activation; customer communication
- Capacity spike (unexpected): Auto-scaling activation; burst capacity; cost monitoring; load shedding
- Monitoring tool outage: Backup tool activation; manual health checks; vendor support; SLA credit tracking
- Noisy neighbor (shared infra): Resource isolation; limit enforcement; tenant migration; performance analysis
- Log volume spike: Log sampling; cost alert; retention adjustment; root cause; capacity
- Metric cardinality explosion: Tag limiting; metric tiering; cost optimization; cleanup