IT AI Skill
Observability Monitoring
Design and implement comprehensive observability including metrics, logs, traces, and alerting. Set up monitoring dashboards, APM, distributed tracing, log aggregation, and intelligent alerting. Use when building monitoring systems, implementing observabili...
Observability & Monitoring
Design and implement comprehensive observability systems for infrastructure, applications, and business metrics.
Workflow
1. Observability Architecture (Three Pillars)
OBSERVABILITY STACK
═══════════════════════════════════════
PILLAR 1: METRICS (What happened?)
═══════════════════════════════════════
→ Collection: Prometheus, Node Exporter, cAdvisor
→ Application metrics: OpenTelemetry SDK, custom exporters
→ Storage: Prometheus TSDB, Thanos (long-term), Cortex
→ Dashboards: Grafana
→ Alerting: Alertmanager, PagerDuty integration
PILLAR 2: LOGS (Why did it happen?)
═══════════════════════════════════════
→ Collection: Fluent Bit (sidecar), Filebeat, Vector
→ Processing: Log parsing, enrichment, correlation
→ Storage: Elasticsearch, Loki, CloudWatch
→ Search/Analysis: Kibana, Grafana Loki Explore
PILLAR 3: TRACES (Where did it happen?)
═══════════════════════════════════════
→ Instrumentation: OpenTelemetry Auto-Instrumentation
→ Export: OTLP → Jaeger, Tempo, X-Ray
→ Sampling: Head-based (10%) + Tail-based (hot paths)
→ Analysis: Distributed trace waterfall, service dependency map
CORRELATION:
═══════════════════════════════════════
→ Trace ID propagated across metrics, logs, traces
→ Grafana unified view: Click metric → See related logs → Follow trace
→ Correlation attributes: request_id, user_id, session_id
2. SLO/SLI Framework
SERVICE LEVEL OBJECTIVES (SLOs)
═══════════════════════════════════════
API Gateway SLOs:
═══════════════════════════════════════
SLO Target Measurement Error Budget (monthly)
─────────────────────────────────────────────────────────────────────────────
Availability 99.95% Uptime / Total time 2.63 minutes
Latency (P95) < 200ms P95 < 200ms ratio 5% of requests
Error Rate < 0.1% 5xx / Total requests 0.1%
Auth Latency (P99) < 50ms P99 < 50ms ratio 1% of requests
Error Budget Calculation:
═══════════════════════════════════════
Monthly error budget (Availability):
→ Target: 99.95%
→ Allowed downtime: (1 - 0.9995) × 30 × 24 × 60 = 2.16 minutes
Error budget burn rate:
═══════════════════════════════════════
→ Current month: 0.8 minutes consumed (37% of budget)
→ Burn rate: Low (green)
→ Alert thresholds:
· FAST burn (14.4x): 5% budget in 1 hour → Page
· HIGH burn (6x): 25% budget in 6 hours → Page
· MODERATE burn (1x): 50% budget in 30 days → Warning
· SLOW burn (0.5x): Budget >80% → Warning
SLO DASHBOARD:
═══════════════════════════════════════
→ 28-day rolling window for all SLOs
→ Error budget remaining (visual gauge)
→ Burn rate (current vs threshold)
→ Trend: Improving, stable, or degrading
→ Post-incident: Impact on budget
3. Alerting Strategy
ALERTING FRAMEWORK
═══════════════════════════════════════
ALERT CLASSIFICATION:
═══════════════════════════════════════
Class Response Time Escalation Examples
─────────────────────────────────────────────────────────────────────
P1 15 minutes Page + Escalate Service down, data loss
P2 30 minutes Page Major degradation, errors spike
P3 4 hours Notify Warning trend, capacity warning
P4 Next business Log only Informational, trend monitoring
ALERT ROUTING:
═══════════════════════════════════════
→ Alertmanager routes by labels:
team: { infrastructure, application, database, network, security }
severity: { critical, warning, info }
environment: { production, staging, development }
→ Routing rules:
severity=critical + environment=production → PagerDuty (on-call)
severity=warning + environment=production → Slack (#alerts)
severity=info → Email digest (daily)
ALERT BEST PRACTICES:
═══════════════════════════════════════
→ Every alert must be actionable
→ No alerts without runbook
→ Alert fatigue prevention:
· Deduplication (group similar alerts)
· Inhibition (suppress dependent alerts)
· Silence windows (maintenance, known issues)
· Minimum resolution: 5 minutes between re-alerting
RUNBOOK TEMPLATE:
═══════════════════════════════════════
Alert: High Error Rate (>0.5% 5xx)
═══════════════════════════════════════
Symptom: 5xx error rate exceeds threshold
Impact: User-facing errors, potential revenue loss
Diagnostic steps:
1. Check Grafana dashboard: "API Error Rate"
2. Check recent deployments (last 2 hours): `kubectl get deployments --sort-by=metadata.creationTimestamp`
3. Check pod status: `kubectl get pods -n production`
4. Check logs: `kubectl logs <pod> --tail=100 | grep ERROR`
5. Check downstream dependencies: Database, Redis, external APIs
Resolution:
→ If deployment-related: Rollback (`kubectl rollout undo deployment/<name>`)
→ If pod crash: Scale up (`kubectl scale deployment/<name> --replicas=5`)
→ If database: Check connection pool, restart database proxy
→ If external: Enable circuit breaker, fall back to cached response
Escalation: If unresolved after 15 minutes → Page senior engineer
4. APM Configuration
APPLICATION PERFORMANCE MONITORING
═══════════════════════════════════════
INSTRUMENTATION (OpenTelemetry):
═══════════════════════════════════════
Auto-instrumentation (zero-code):
→ Java: Java Agent (opentelemetry-javaagent)
→ Python: opentelemetry-instrument
→ Node.js: @opentelemetry/auto-instrumentation
→ .NET: OpenTelemetry .NET Auto-Instrumentation
→ Go: Manual (wrap handlers, middleware)
Custom instrumentation:
→ Business transactions (checkout, signup, payment)
→ Database queries (slow query detection)
→ External API calls (latency, error rate)
→ Cache hits/misses (Redis, Memcached)
SERVICE MAP:
═══════════════════════════════════════
web-frontend → api-gateway → auth-service → user-db
→ order-service → payment-db
→ payment-gateway (external)
→ inventory-service → redis-cache
→ notification-service → email-svc (external)
Dependencies auto-discovered from traces
→ Call volume, latency, error rate per edge
→ Bottleneck identification
→ Architecture validation
PERFORMANCE BASELINES:
═══════════════════════════════════════
Service Avg Latency P95 P99 Error Rate Throughput
───────────────────────────────────────────────────────────────────────────────
API Gateway 25ms 85ms 150ms 0.02% 500 rps
Auth Service 15ms 45ms 80ms 0.01% 300 rps
Order Service 35ms 120ms 200ms 0.05% 150 rps
Inventory Service 10ms 30ms 55ms 0.01% 400 rps
Database Query 5ms 15ms 25ms 0.01% 1200 qps
5. Log Management
LOG MANAGEMENT STRATEGY
═══════════════════════════════════════
LOG STRUCTURE (JSON):
═══════════════════════════════════════
{
"timestamp": "2024-01-15T10:30:45.123Z",
"level": "ERROR",
"service": "api-gateway",
"version": "2.1.0",
"trace_id": "abc123def456",
"span_id": "789xyz",
"message": "Payment processing failed",
"request_id": "req_12345",
"user_id": "user_67890",
"error": {
"type": "TimeoutException",
"message": "Payment gateway timeout after 5000ms",
"stack_trace": "..."
},
"context": {
"endpoint": "/api/v2/payments",
"method": "POST",
"ip": "192.168.1.100",
"duration_ms": 5023
}
}
LOG AGGREGATION PIPELINE:
═══════════════════════════════════════
1. Collection: Fluent Bit (sidecar or daemonset)
2. Parsing: JSON parse, field extraction
3. Enrichment: Add cluster, namespace, pod labels
4. Filtering: Drop debug logs in production, redact PII
5. Routing:
→ Error logs → Elasticsearch (long retention)
→ Info logs → Loki (cost-efficient)
→ Audit logs → S3 (compliance, 7-year retention)
6. Indexing: Time-based indices (daily)
7. Retention:
→ Hot: 7 days (Elasticsearch)
→ Warm: 30 days (Loki)
→ Cold: 1 year (S3)
→ Archive: 7 years (glacier, compliance)
LOG ANALYSIS:
═══════════════════════════════════════
→ Error patterns: Group by error type (identify common issues)
→ Anomaly detection: Spike in error rate
→ Correlation: Link logs to traces (trace_id)
→ Alerting: Error rate threshold, pattern match
→ Audit: Access logs, admin actions, data changes
Edge Cases
- High volume: Multi-tenancy, sampling, log rotation, cost optimization
- Regulated industries: Log retention requirements, access controls, audit trails
- Multi-region: Centralized vs distributed logging
- Cost management: Log volume control, tiered storage
- Real-time: Streaming logs for incident response
Integration Points
- Metrics: Prometheus, Datadog, New Relic, CloudWatch
- Logs: Elasticsearch, Loki, Splunk, CloudWatch Logs
- Traces: Jaeger, Tempo, X-Ray, Zipkin
- APM: DataDog APM, New Relic, Dynatrace, AppDynamics
- Alerting: Alertmanager, PagerDuty, OpsGenie
- Dashboards: Grafana, Kibana, DataDog Dashboards
Output
Observability Status
OBSERVABILITY STATUS — Production
═══════════════════════════════════════
Services monitored: 28 (100% instrumented)
SLO compliance:
Availability: 99.97% (target: 99.95%) ✓
P95 Latency: 82ms (target: <200ms) ✓
Error Rate: 0.03% (target: <0.1%) ✓
Error budget: 63% remaining (on track)
Active alerts: 0
Pending investigations: 1 (minor latency trend)
Logs: 2.4TB/day (within budget)
Traces: 10M spans/day (10% sampling)