IT AI Skill

Observability Monitoring

Design and implement comprehensive observability including metrics, logs, traces, and alerting. Set up monitoring dashboards, APM, distributed tracing, log aggregation, and intelligent alerting. Use when building monitoring systems, implementing observability, configuring APM, setting up alerting, or troubleshooting with distributed traces. Triggers on phrases like "observability", "monitoring", "APM", "distributed tracing", "log aggregation", "metrics", "alerting", "OpenTelemetry", "Prometheus", "Grafana", "ELK", "trace", "span", "SLO", "error budget", "runbook", "on-call", "pager duty".

Observability & Monitoring

Design and implement comprehensive observability systems for infrastructure, applications, and business metrics.

Workflow

1. Observability Architecture (Three Pillars)

OBSERVABILITY STACK
═══════════════════════════════════════

PILLAR 1: METRICS (What happened?)
═══════════════════════════════════════

  → Collection: Prometheus, Node Exporter, cAdvisor
  → Application metrics: OpenTelemetry SDK, custom exporters
  → Storage: Prometheus TSDB, Thanos (long-term), Cortex
  → Dashboards: Grafana
  → Alerting: Alertmanager, PagerDuty integration

PILLAR 2: LOGS (Why did it happen?)
═══════════════════════════════════════

  → Collection: Fluent Bit (sidecar), Filebeat, Vector
  → Processing: Log parsing, enrichment, correlation
  → Storage: Elasticsearch, Loki, CloudWatch
  → Search/Analysis: Kibana, Grafana Loki Explore

PILLAR 3: TRACES (Where did it happen?)
═══════════════════════════════════════

  → Instrumentation: OpenTelemetry Auto-Instrumentation
  → Export: OTLP → Jaeger, Tempo, X-Ray
  → Sampling: Head-based (10%) + Tail-based (hot paths)
  → Analysis: Distributed trace waterfall, service dependency map

CORRELATION:
═══════════════════════════════════════

  → Trace ID propagated across metrics, logs, traces
  → Grafana unified view: Click metric → See related logs → Follow trace
  → Correlation attributes: request_id, user_id, session_id

2. SLO/SLI Framework

SERVICE LEVEL OBJECTIVES (SLOs)
═══════════════════════════════════════

API Gateway SLOs:
═══════════════════════════════════════

SLO                    Target    Measurement          Error Budget (monthly)
─────────────────────────────────────────────────────────────────────────────
Availability           99.95%    Uptime / Total time    2.63 minutes
Latency (P95)          < 200ms   P95 < 200ms ratio    5% of requests
Error Rate             < 0.1%    5xx / Total requests   0.1%
Auth Latency (P99)     < 50ms    P99 < 50ms ratio     1% of requests

Error Budget Calculation:
═══════════════════════════════════════

Monthly error budget (Availability):
  → Target: 99.95%
  → Allowed downtime: (1 - 0.9995) × 30 × 24 × 60 = 2.16 minutes

Error budget burn rate:
═══════════════════════════════════════

  → Current month: 0.8 minutes consumed (37% of budget)
  → Burn rate: Low (green)
  → Alert thresholds:
     · FAST burn (14.4x): 5% budget in 1 hour → Page
     · HIGH burn (6x): 25% budget in 6 hours → Page
     · MODERATE burn (1x): 50% budget in 30 days → Warning
     · SLOW burn (0.5x): Budget >80% → Warning

SLO DASHBOARD:
═══════════════════════════════════════

  → 28-day rolling window for all SLOs
  → Error budget remaining (visual gauge)
  → Burn rate (current vs threshold)
  → Trend: Improving, stable, or degrading
  → Post-incident: Impact on budget

3. Alerting Strategy

ALERTING FRAMEWORK
═══════════════════════════════════════

ALERT CLASSIFICATION:
═══════════════════════════════════════

Class    Response Time   Escalation         Examples
─────────────────────────────────────────────────────────────────────
P1       15 minutes      Page + Escalate    Service down, data loss
P2       30 minutes      Page               Major degradation, errors spike
P3       4 hours         Notify             Warning trend, capacity warning
P4       Next business    Log only          Informational, trend monitoring

ALERT ROUTING:
═══════════════════════════════════════

  → Alertmanager routes by labels:
     team: { infrastructure, application, database, network, security }
     severity: { critical, warning, info }
     environment: { production, staging, development }

  → Routing rules:
     severity=critical + environment=production → PagerDuty (on-call)
     severity=warning + environment=production → Slack (#alerts)
     severity=info → Email digest (daily)

ALERT BEST PRACTICES:
═══════════════════════════════════════

  → Every alert must be actionable
  → No alerts without runbook
  → Alert fatigue prevention:
     · Deduplication (group similar alerts)
     · Inhibition (suppress dependent alerts)
     · Silence windows (maintenance, known issues)
     · Minimum resolution: 5 minutes between re-alerting

RUNBOOK TEMPLATE:
═══════════════════════════════════════

Alert: High Error Rate (>0.5% 5xx)
═══════════════════════════════════════

  Symptom: 5xx error rate exceeds threshold
  Impact: User-facing errors, potential revenue loss

  Diagnostic steps:
  1. Check Grafana dashboard: "API Error Rate"
  2. Check recent deployments (last 2 hours): `kubectl get deployments --sort-by=metadata.creationTimestamp`
  3. Check pod status: `kubectl get pods -n production`
  4. Check logs: `kubectl logs <pod> --tail=100 | grep ERROR`
  5. Check downstream dependencies: Database, Redis, external APIs

  Resolution:
  → If deployment-related: Rollback (`kubectl rollout undo deployment/<name>`)
  → If pod crash: Scale up (`kubectl scale deployment/<name> --replicas=5`)
  → If database: Check connection pool, restart database proxy
  → If external: Enable circuit breaker, fall back to cached response

  Escalation: If unresolved after 15 minutes → Page senior engineer

4. APM Configuration

APPLICATION PERFORMANCE MONITORING
═══════════════════════════════════════

INSTRUMENTATION (OpenTelemetry):
═══════════════════════════════════════

  Auto-instrumentation (zero-code):
    → Java: Java Agent (opentelemetry-javaagent)
    → Python: opentelemetry-instrument
    → Node.js: @opentelemetry/auto-instrumentation
    → .NET: OpenTelemetry .NET Auto-Instrumentation
    → Go: Manual (wrap handlers, middleware)

  Custom instrumentation:
    → Business transactions (checkout, signup, payment)
    → Database queries (slow query detection)
    → External API calls (latency, error rate)
    → Cache hits/misses (Redis, Memcached)

SERVICE MAP:
═══════════════════════════════════════

  web-frontend → api-gateway → auth-service → user-db
                              → order-service → payment-db
                                         → payment-gateway (external)
                              → inventory-service → redis-cache
                              → notification-service → email-svc (external)

  Dependencies auto-discovered from traces
  → Call volume, latency, error rate per edge
  → Bottleneck identification
  → Architecture validation

PERFORMANCE BASELINES:
═══════════════════════════════════════

Service            Avg Latency    P95       P99       Error Rate   Throughput
───────────────────────────────────────────────────────────────────────────────
API Gateway        25ms          85ms      150ms     0.02%        500 rps
Auth Service       15ms          45ms      80ms      0.01%        300 rps
Order Service      35ms          120ms     200ms     0.05%        150 rps
Inventory Service  10ms          30ms      55ms      0.01%        400 rps
Database Query     5ms           15ms      25ms      0.01%        1200 qps

5. Log Management

LOG MANAGEMENT STRATEGY
═══════════════════════════════════════

LOG STRUCTURE (JSON):
═══════════════════════════════════════

{
  "timestamp": "2024-01-15T10:30:45.123Z",
  "level": "ERROR",
  "service": "api-gateway",
  "version": "2.1.0",
  "trace_id": "abc123def456",
  "span_id": "789xyz",
  "message": "Payment processing failed",
  "request_id": "req_12345",
  "user_id": "user_67890",
  "error": {
    "type": "TimeoutException",
    "message": "Payment gateway timeout after 5000ms",
    "stack_trace": "..."
  },
  "context": {
    "endpoint": "/api/v2/payments",
    "method": "POST",
    "ip": "192.168.1.100",
    "duration_ms": 5023
  }
}

LOG AGGREGATION PIPELINE:
═══════════════════════════════════════

  1. Collection: Fluent Bit (sidecar or daemonset)
  2. Parsing: JSON parse, field extraction
  3. Enrichment: Add cluster, namespace, pod labels
  4. Filtering: Drop debug logs in production, redact PII
  5. Routing:
     → Error logs → Elasticsearch (long retention)
     → Info logs → Loki (cost-efficient)
     → Audit logs → S3 (compliance, 7-year retention)
  6. Indexing: Time-based indices (daily)
  7. Retention:
     → Hot: 7 days (Elasticsearch)
     → Warm: 30 days (Loki)
     → Cold: 1 year (S3)
     → Archive: 7 years (glacier, compliance)

LOG ANALYSIS:
═══════════════════════════════════════

  → Error patterns: Group by error type (identify common issues)
  → Anomaly detection: Spike in error rate
  → Correlation: Link logs to traces (trace_id)
  → Alerting: Error rate threshold, pattern match
  → Audit: Access logs, admin actions, data changes

Edge Cases

High volume: Multi-tenancy, sampling, log rotation, cost optimization
Regulated industries: Log retention requirements, access controls, audit trails
Multi-region: Centralized vs distributed logging
Cost management: Log volume control, tiered storage
Real-time: Streaming logs for incident response

Integration Points

Metrics: Prometheus, Datadog, New Relic, CloudWatch
Logs: Elasticsearch, Loki, Splunk, CloudWatch Logs
Traces: Jaeger, Tempo, X-Ray, Zipkin
APM: DataDog APM, New Relic, Dynatrace, AppDynamics
Alerting: Alertmanager, PagerDuty, OpsGenie
Dashboards: Grafana, Kibana, DataDog Dashboards

Output

Observability Status

OBSERVABILITY STATUS — Production
═══════════════════════════════════════

Services monitored: 28 (100% instrumented)
SLO compliance:
  Availability: 99.97% (target: 99.95%) ✓
  P95 Latency: 82ms (target: <200ms) ✓
  Error Rate: 0.03% (target: <0.1%) ✓

Error budget: 63% remaining (on track)
Active alerts: 0
Pending investigations: 1 (minor latency trend)

Logs: 2.4TB/day (within budget)
Traces: 10M spans/day (10% sampling)

Disclaimer: All rights reserved by Circulos AI. These skills are specifically designed for Claude Code, Claude Cowork, Codex, and OpenClaw. When using or referencing any skill, please provide proper attribution to Circulos AI.