---
name: observability-monitoring
description: Design and implement comprehensive observability including metrics, logs, traces, and alerting. Set up monitoring dashboards, APM, distributed tracing, log aggregation, and intelligent alerting. Use when building monitoring systems, implementing observability, configuring APM, setting up alerting, or troubleshooting with distributed traces. Triggers on phrases like "observability", "monitoring", "APM", "distributed tracing", "log aggregation", "metrics", "alerting", "OpenTelemetry", "Prometheus", "Grafana", "ELK", "trace", "span", "SLO", "error budget", "runbook", "on-call", "pager duty".
---

# Observability & Monitoring

Design and implement comprehensive observability systems for infrastructure, applications, and business metrics.

## Workflow

### 1. Observability Architecture (Three Pillars)

```
OBSERVABILITY STACK
═══════════════════════════════════════

PILLAR 1: METRICS (What happened?)
═══════════════════════════════════════

  → Collection: Prometheus, Node Exporter, cAdvisor
  → Application metrics: OpenTelemetry SDK, custom exporters
  → Storage: Prometheus TSDB, Thanos (long-term), Cortex
  → Dashboards: Grafana
  → Alerting: Alertmanager, PagerDuty integration

PILLAR 2: LOGS (Why did it happen?)
═══════════════════════════════════════

  → Collection: Fluent Bit (sidecar), Filebeat, Vector
  → Processing: Log parsing, enrichment, correlation
  → Storage: Elasticsearch, Loki, CloudWatch
  → Search/Analysis: Kibana, Grafana Loki Explore

PILLAR 3: TRACES (Where did it happen?)
═══════════════════════════════════════

  → Instrumentation: OpenTelemetry Auto-Instrumentation
  → Export: OTLP → Jaeger, Tempo, X-Ray
  → Sampling: Head-based (10%) + Tail-based (hot paths)
  → Analysis: Distributed trace waterfall, service dependency map

CORRELATION:
═══════════════════════════════════════

  → Trace ID propagated across metrics, logs, traces
  → Grafana unified view: Click metric → See related logs → Follow trace
  → Correlation attributes: request_id, user_id, session_id
```

### 2. SLO/SLI Framework

```
SERVICE LEVEL OBJECTIVES (SLOs)
═══════════════════════════════════════

API Gateway SLOs:
═══════════════════════════════════════

SLO                    Target    Measurement          Error Budget (monthly)
─────────────────────────────────────────────────────────────────────────────
Availability           99.95%    Uptime / Total time    2.63 minutes
Latency (P95)          < 200ms   P95 < 200ms ratio    5% of requests
Error Rate             < 0.1%    5xx / Total requests   0.1%
Auth Latency (P99)     < 50ms    P99 < 50ms ratio     1% of requests

Error Budget Calculation:
═══════════════════════════════════════

Monthly error budget (Availability):
  → Target: 99.95%
  → Allowed downtime: (1 - 0.9995) × 30 × 24 × 60 = 2.16 minutes

Error budget burn rate:
═══════════════════════════════════════

  → Current month: 0.8 minutes consumed (37% of budget)
  → Burn rate: Low (green)
  → Alert thresholds:
     · FAST burn (14.4x): 5% budget in 1 hour → Page
     · HIGH burn (6x): 25% budget in 6 hours → Page
     · MODERATE burn (1x): 50% budget in 30 days → Warning
     · SLOW burn (0.5x): Budget >80% → Warning

SLO DASHBOARD:
═══════════════════════════════════════

  → 28-day rolling window for all SLOs
  → Error budget remaining (visual gauge)
  → Burn rate (current vs threshold)
  → Trend: Improving, stable, or degrading
  → Post-incident: Impact on budget
```

### 3. Alerting Strategy

```
ALERTING FRAMEWORK
═══════════════════════════════════════

ALERT CLASSIFICATION:
═══════════════════════════════════════

Class    Response Time   Escalation         Examples
─────────────────────────────────────────────────────────────────────
P1       15 minutes      Page + Escalate    Service down, data loss
P2       30 minutes      Page               Major degradation, errors spike
P3       4 hours         Notify             Warning trend, capacity warning
P4       Next business    Log only          Informational, trend monitoring

ALERT ROUTING:
═══════════════════════════════════════

  → Alertmanager routes by labels:
     team: { infrastructure, application, database, network, security }
     severity: { critical, warning, info }
     environment: { production, staging, development }

  → Routing rules:
     severity=critical + environment=production → PagerDuty (on-call)
     severity=warning + environment=production → Slack (#alerts)
     severity=info → Email digest (daily)

ALERT BEST PRACTICES:
═══════════════════════════════════════

  → Every alert must be actionable
  → No alerts without runbook
  → Alert fatigue prevention:
     · Deduplication (group similar alerts)
     · Inhibition (suppress dependent alerts)
     · Silence windows (maintenance, known issues)
     · Minimum resolution: 5 minutes between re-alerting

RUNBOOK TEMPLATE:
═══════════════════════════════════════

Alert: High Error Rate (>0.5% 5xx)
═══════════════════════════════════════

  Symptom: 5xx error rate exceeds threshold
  Impact: User-facing errors, potential revenue loss

  Diagnostic steps:
  1. Check Grafana dashboard: "API Error Rate"
  2. Check recent deployments (last 2 hours): `kubectl get deployments --sort-by=metadata.creationTimestamp`
  3. Check pod status: `kubectl get pods -n production`
  4. Check logs: `kubectl logs <pod> --tail=100 | grep ERROR`
  5. Check downstream dependencies: Database, Redis, external APIs

  Resolution:
  → If deployment-related: Rollback (`kubectl rollout undo deployment/<name>`)
  → If pod crash: Scale up (`kubectl scale deployment/<name> --replicas=5`)
  → If database: Check connection pool, restart database proxy
  → If external: Enable circuit breaker, fall back to cached response

  Escalation: If unresolved after 15 minutes → Page senior engineer
```

### 4. APM Configuration

```
APPLICATION PERFORMANCE MONITORING
═══════════════════════════════════════

INSTRUMENTATION (OpenTelemetry):
═══════════════════════════════════════

  Auto-instrumentation (zero-code):
    → Java: Java Agent (opentelemetry-javaagent)
    → Python: opentelemetry-instrument
    → Node.js: @opentelemetry/auto-instrumentation
    → .NET: OpenTelemetry .NET Auto-Instrumentation
    → Go: Manual (wrap handlers, middleware)

  Custom instrumentation:
    → Business transactions (checkout, signup, payment)
    → Database queries (slow query detection)
    → External API calls (latency, error rate)
    → Cache hits/misses (Redis, Memcached)

SERVICE MAP:
═══════════════════════════════════════

  web-frontend → api-gateway → auth-service → user-db
                              → order-service → payment-db
                                         → payment-gateway (external)
                              → inventory-service → redis-cache
                              → notification-service → email-svc (external)

  Dependencies auto-discovered from traces
  → Call volume, latency, error rate per edge
  → Bottleneck identification
  → Architecture validation

PERFORMANCE BASELINES:
═══════════════════════════════════════

Service            Avg Latency    P95       P99       Error Rate   Throughput
───────────────────────────────────────────────────────────────────────────────
API Gateway        25ms          85ms      150ms     0.02%        500 rps
Auth Service       15ms          45ms      80ms      0.01%        300 rps
Order Service      35ms          120ms     200ms     0.05%        150 rps
Inventory Service  10ms          30ms      55ms      0.01%        400 rps
Database Query     5ms           15ms      25ms      0.01%        1200 qps
```

### 5. Log Management

```
LOG MANAGEMENT STRATEGY
═══════════════════════════════════════

LOG STRUCTURE (JSON):
═══════════════════════════════════════

{
  "timestamp": "2024-01-15T10:30:45.123Z",
  "level": "ERROR",
  "service": "api-gateway",
  "version": "2.1.0",
  "trace_id": "abc123def456",
  "span_id": "789xyz",
  "message": "Payment processing failed",
  "request_id": "req_12345",
  "user_id": "user_67890",
  "error": {
    "type": "TimeoutException",
    "message": "Payment gateway timeout after 5000ms",
    "stack_trace": "..."
  },
  "context": {
    "endpoint": "/api/v2/payments",
    "method": "POST",
    "ip": "192.168.1.100",
    "duration_ms": 5023
  }
}

LOG AGGREGATION PIPELINE:
═══════════════════════════════════════

  1. Collection: Fluent Bit (sidecar or daemonset)
  2. Parsing: JSON parse, field extraction
  3. Enrichment: Add cluster, namespace, pod labels
  4. Filtering: Drop debug logs in production, redact PII
  5. Routing:
     → Error logs → Elasticsearch (long retention)
     → Info logs → Loki (cost-efficient)
     → Audit logs → S3 (compliance, 7-year retention)
  6. Indexing: Time-based indices (daily)
  7. Retention:
     → Hot: 7 days (Elasticsearch)
     → Warm: 30 days (Loki)
     → Cold: 1 year (S3)
     → Archive: 7 years (glacier, compliance)

LOG ANALYSIS:
═══════════════════════════════════════

  → Error patterns: Group by error type (identify common issues)
  → Anomaly detection: Spike in error rate
  → Correlation: Link logs to traces (trace_id)
  → Alerting: Error rate threshold, pattern match
  → Audit: Access logs, admin actions, data changes
```

## Edge Cases

- **High volume**: Multi-tenancy, sampling, log rotation, cost optimization
- **Regulated industries**: Log retention requirements, access controls, audit trails
- **Multi-region**: Centralized vs distributed logging
- **Cost management**: Log volume control, tiered storage
- **Real-time**: Streaming logs for incident response

## Integration Points

- **Metrics**: Prometheus, Datadog, New Relic, CloudWatch
- **Logs**: Elasticsearch, Loki, Splunk, CloudWatch Logs
- **Traces**: Jaeger, Tempo, X-Ray, Zipkin
- **APM**: DataDog APM, New Relic, Dynatrace, AppDynamics
- **Alerting**: Alertmanager, PagerDuty, OpsGenie
- **Dashboards**: Grafana, Kibana, DataDog Dashboards

## Output

### Observability Status

```
OBSERVABILITY STATUS — Production
═══════════════════════════════════════

Services monitored: 28 (100% instrumented)
SLO compliance:
  Availability: 99.97% (target: 99.95%) ✓
  P95 Latency: 82ms (target: <200ms) ✓
  Error Rate: 0.03% (target: <0.1%) ✓

Error budget: 63% remaining (on track)
Active alerts: 0
Pending investigations: 1 (minor latency trend)

Logs: 2.4TB/day (within budget)
Traces: 10M spans/day (10% sampling)
```
