---
name: site-reliability-engineering
description: Implement SRE practices with SLIs, SLOs, SLAs, error budgets, service level indicators, reliability engineering, incident management, postmortems, toil reduction, capacity planning, and change management. Use when defining SLOs, managing error budgets, reducing toil, conducting blameless postmortems, planning capacity, implementing runbooks, or improving service reliability. Triggers on phrases like "SRE", "site reliability engineering", "SLO", "SLI", "SLA", "error budget", "toil reduction", "blameless postmortem", "reliability engineering", "capacity planning", "runbook", "incident management SRE", "service reliability", "availability target", "reliability metrics", "SRE practices", "site reliability".
---

# Site Reliability Engineering

Bridge the gap between development and operations with data-driven reliability practices.

## Workflow

1. Define service reliability requirements: SLIs, SLOs, SLAs aligned to business needs and user experience.
2. Implement error budget management: calculate remaining budget, set consumption policies, define burn rate alerts.
3. Automate operational toil: identify repetitive tasks, build automation, measure toil reduction, eliminate manual work.
4. Establish incident management: on-call rotation, incident response playbook, severity classification, communication plan.
5. Conduct blameless postmortems: document what happened, why, and how to prevent recurrence without assigning blame.
6. Perform capacity planning: forecast resource needs, right-size infrastructure, optimize costs while maintaining reliability.
7. Manage changes safely: canary deployments, feature flags, progressive rollouts, automated rollback, change approval.
8. Monitor and report: reliability dashboards, SLO compliance reports, reliability trend analysis, executive summaries.

## SRE Framework

### SLI/SLO/SLA Architecture

```
SRE FRAMEWORK — SLI/SLO/SLA ARCHITECTURE
===========================================

Organization: 47 services managed under SRE model (38 microservices, 8 legacy, 1 monolith)
SRE Team: 12 engineers (1 SRE per 4 services average, following Google SRE ratio)
SRE Maturity: Level 3 of 4 (Advanced — automated remediation, predictive analytics, self-service reliability)

SERVICE LEVEL INDICATORS (SLIs):
  SLIs are measurable definitions of service quality, expressed as percentages (0-100%)

  SLI Categories:
    Availability: Is the service responding to requests? (HTTP 2xx/3xx vs 5xx)
    Latency: How fast does the service respond? (request duration percentiles)
    Throughput: How much work can the service handle? (requests per second)
    Correctness: Are responses correct? (business logic validation, data integrity)
    Freshness: How current is the data? (time since last successful data sync)

  SLI Definition Framework:
    ┌──────────────────────────────────────────────────────────────────────────────────┐
    │ SLI: Availability (API Response Success Rate)                                    │
    │  Formula: (HTTP 2xx + HTTP 3xx) / Total HTTP requests × 100                     │
    │  Exclusions: HTTP 4xx (client errors — not service fault), health checks, synth. │
    │  Window: Rolling 30 days (continuous evaluation)                                │
    │  Granularity: 5-minute windows (for burn rate calculation)                       │
    │  Source: API Gateway access logs + APM traces                                   │
    │  Owner: API Platform Team                                                        │
    │  Reviewed: Quarterly (last: Q4 2024)                                            │
    └──────────────────────────────────────────────────────────────────────────────────┘
    
    ┌──────────────────────────────────────────────────────────────────────────────────┐
    │ SLI: Latency (p95 Request Duration)                                              │
    │  Formula: % of requests completing within target latency (e.g., < 300ms)         │
    │  Target: 99% of requests complete within service-specific latency target         │
    │  Window: Rolling 30 days                                                         │
    │  Granularity: 1-minute windows                                                   │
    │  Source: APM distributed traces + API Gateway logs                               │
    │  Exclusions: Long-running operations (file upload, report generation, export)    │
    │  Owner: Service owner team                                                       │
    └──────────────────────────────────────────────────────────────────────────────────┘

    ┌──────────────────────────────────────────────────────────────────────────────────┐
    │ SLI: Correctness (Payment Processing Accuracy)                                   │
    │  Formula: Successful payments / Total payment attempts × 100                     │
    │  Target: 99.99% (4 nines — payment-critical)                                    │
    │  Window: Rolling 30 days                                                         │
    │  Source: Payment service transaction logs + reconciliation system                │
    │  Exclusions: Insufficient funds, card declined by bank (not service errors)      │
    │  Owner: Payments Team                                                            │
    └──────────────────────────────────────────────────────────────────────────────────┘

SERVICE LEVEL OBJECTIVES (SLOs):
  SLOs are targets for SLIs, defining acceptable reliability levels

  ┌────────────────────────┬──────────────────┬──────────────────┬──────────────────┬────────────────────┐
  │ Service                │ Availability SLO │ Latency SLO      │ Error Rate SLO   │ SLO Type           │
  ├────────────────────────┼──────────────────┼──────────────────┼──────────────────┼────────────────────┤
  │ API Gateway            │ 99.95%           │ p95 < 50ms       │ < 0.05%          │ 4 nines (critical) │
  │ Auth Service           │ 99.99%           │ p95 < 100ms      │ < 0.01%          │ 4 nines (critical) │
  │ User Service           │ 99.9%            │ p95 < 150ms      │ < 0.1%           │ 3 nines (standard) │
  │ Order Service          │ 99.95%           │ p95 < 300ms      │ < 0.05%          │ 4 nines (critical) │
  │ Payment Service        │ 99.99%           │ p95 < 500ms      │ < 0.001%         │ 4 nines (critical) │
  │ Search Service         │ 99.9%            │ p95 < 200ms      │ < 0.1%           │ 3 nines (standard) │
  │ Notification Service   │ 99.5%            │ p95 < 100ms      │ < 0.5%           │ 3 nines (lower)    │
  │ Inventory Service      │ 99.9%            │ p95 < 150ms      │ < 0.1%           │ 3 nines (standard) │
  │ Recommendation Engine  │ 99.5%            │ p95 < 400ms      │ < 0.5%           │ 3 nines (lower)    │
  │ Analytics Aggregator   │ 99.0%            │ N/A (batch)      │ < 1.0%           │ 3 nines (batch)    │
  │ File Processing        │ 99.5%            │ N/A (async)      │ < 0.5%           │ 3 nines (async)    │
  │ Webhook Dispatcher     │ 99.9%            │ p95 < 100ms      │ < 0.1%           │ 3 nines (standard) │
  └────────────────────────┴──────────────────┴──────────────────┴──────────────────┴────────────────────┘

SERVICE LEVEL AGREEMENTS (SLAs):
  SLAs are external commitments to customers (subset of SLOs, typically lower thresholds)

  Customer SLA Commitments:
    Enterprise customers: 99.9% availability (34 minutes allowed downtime/month)
    Standard customers: 99.5% availability (3.65 hours allowed downtime/month)
    Free tier: Best effort (no SLA commitment)
  
  SLA Penalties (Enterprise contracts):
    99.9% - 99.99%: No penalty (within SLA)
    99.0% - 99.9%: 10% service credit
    95.0% - 99.0%: 25% service credit
    90.0% - 95.0%: 50% service credit
    < 90.0%: 100% service credit + contract review
  
  SLA compliance (last 12 months):
    Months within SLA: 12/12 (100%) — no SLA penalties paid
    Closest to breach: Month 4 (99.92% — 8 minutes of allowed 34 minutes consumed by incident)
    Average monthly compliance: 99.97%

ERROR BUDGET MANAGEMENT:
  Error budget = 100% - SLO target (the allowable error rate before SLO is breached)

  ┌────────────────────────┬──────────────────┬──────────────────┬──────────────────┬────────────────────┐
  │ Service                │ Error Budget     │ Budget Consumed  │ Budget Remaining │ Budget Status      │
  │                        │ (monthly)        │ (this month)     │ (this month)     │                    │
  ├────────────────────────┼──────────────────┼──────────────────┼──────────────────┼────────────────────┤
  │ API Gateway            │ 4.32 min/day     │ 18 min           │ 126 min          │ GREEN (12.5%)      │
  │ Auth Service           │ 4.32 sec/day     │ 8 sec            │ 115 sec          │ GREEN (6.5%)       │
  │ User Service           │ 43.2 min/month   │ 2.1 hours        │ 11.9 hours       │ GREEN (15.0%)      │
  │ Order Service          │ 2.16 min/day     │ 45 min           │ 215 min          │ GREEN (17.4%)      │
  │ Payment Service        │ 4.32 sec/day     │ 12 sec           │ 103 sec          │ GREEN (10.4%)      │
  │ Search Service         │ 43.2 min/month   │ 3.2 hours        │ 10.8 hours       │ GREEN (22.2%)      │
  │ Notification Service   │ 3.65 hours/month │ 1.8 hours        │ 1.85 hours       │ GREEN (49.3%) ⚠    │
  │ Inventory Service      │ 43.2 min/month   │ 2.5 hours        │ 11.5 hours       │ GREEN (17.4%)      │
  │ Recommendation Engine  │ 3.65 hours/month │ 2.1 hours        │ 1.55 hours       │ GREEN (57.5%) ⚠    │
  │ Webhook Dispatcher     │ 43.2 min/month   │ 1.2 hours        │ 12.8 hours       │ GREEN (8.3%)       │
  └────────────────────────┴──────────────────┴──────────────────┴──────────────────┴────────────────────┘

ERROR BUDGET POLICY:
  Green zone (> 70% budget remaining): Normal operations, all changes allowed
  Yellow zone (30-70% budget remaining): Caution, require SRE approval for risky changes
  Red zone (< 30% budget remaining): Freeze non-critical changes, reliability focus mode
  Exhausted (0% budget remaining): ALL changes frozen (except bug fixes), reliability war room
  
  Current policy enforcement:
    Changes blocked by error budget: 3 (last 30 days)
      1. Feature release delayed (Notification Service — 49% budget, risky deployment)
      2. Infrastructure change postponed (Recommendation Engine — 57% budget, VM migration)
      3. Database migration rescheduled (Inventory Service — entered yellow zone during migration)
    Reliability focus periods: 2 (last 30 days, average duration: 4 days each)
```

### Burn Rate Alerting

```
BURN RATE ALERTING — MULTI-WINDOW APPROACH
============================================

Alerting Strategy: Multi-window burn rate alerts (Google SRE best practice)
Purpose: Detect error budget consumption too fast without alert fatigue from transient spikes

ALERT WINDOWS & THRESHOLDS:
  Short window (detect rapid burns — incidents):
    1-minute window: Alert if error rate > 14.4x budget rate (consumes full budget in < 1 day)
    5-minute window: Alert if error rate > 6x budget rate (consumes full budget in < 3 days)
    30-minute window: Alert if error rate > 3x budget rate (consumes full budget in < 6 days)
  
  Long window (detect slow burns — degradation):
    1-hour window: Alert if error rate > 1.7x budget rate (consumes full budget in < 15 days)
    6-hour window: Alert if error rate > 1.4x budget rate (consumes full budget in < 30 days)
    1-day window: Alert if error rate > 1.0x budget rate (on track to exhaust budget this month)

ALERT FIRING LOGIC:
  Page (immediate notification — PagerDuty):
    Condition: 1-minute OR 5-minute burn rate exceeded
    Severity: P1 (critical) — likely active incident
    Response: On-call engineer investigates immediately
  
  Warning (Slack notification — team alert):
    Condition: 30-minute OR 1-hour burn rate exceeded
    Severity: P2 (high) — potential degradation
    Response: Team investigates within 1 hour
  
  Info (dashboard update — no notification):
    Condition: 6-hour OR 1-day burn rate exceeded
    Severity: P3 (medium) — slow burn, plan remediation
    Response: Track in sprint, plan fix within 1 week

BURN RATE ALERT EFFECTIVENESS (Last 30 Days):
  Total alerts fired: 28
    Page-level (P1): 4 (14.3%) — all valid incidents, rapid error spikes
    Warning-level (P2): 12 (42.9%) — 8 valid degradation, 4 false positives (transient)
    Info-level (P3): 12 (42.9%) — 10 valid slow burns, 2 false positives (planned maintenance)
  
  Alert accuracy: 82.1% (23 of 28 alerts were valid signals)
  Mean time to alert: 2 minutes (from error spike to notification)
  Mean time to respond: 8 minutes (from alert to engineer acknowledgment)
  Mean time to resolve: 22 minutes (from acknowledgment to error rate normalized)

  False positive reduction actions:
    1. Added 30-second minimum duration (prevent 1-minute transient spikes from firing)
    2. Exclude planned maintenance windows (tag in change management system)
    3. Exclude known noisy services during deployment (auto-pause alerts during rolling deploy)
    4. Require 2 of 3 short-window alerts to fire (not just 1 — reduce noise)
```

## Incident Management

### Incident Response Framework

```
INCIDENT MANAGEMENT — SRE FRAMEWORK
=====================================

Incident Management Platform: PagerDuty (alerting) + Jira Service Management (tracking) + Slack (communication)
On-Call Rotation: 12 SRE engineers, 1-week rotations, 4 on-call tiers (L1 → L4)
Escalation Policy: 15 minutes L1 → L2, 30 minutes L2 → L3, 60 minutes L3 → L4 (manager)

INCIDENT SEVERITY CLASSIFICATION:
  ┌─────────────┬─────────────────────────────────────────────┬──────────────────┬──────────────────┐
  │ Severity    │ Criteria                                    │ Response Time    │ Escalation Time  │
  ├─────────────┼─────────────────────────────────────────────┼──────────────────┼──────────────────┤
  │ SEV1        │ Service down, data loss, security breach,   │ 5 minutes        │ 15 minutes       │
  │ (Critical)  │ revenue impact > $10K/hour, regulatory risk │                  │ (L1 → L2)        │
  ├─────────────┼─────────────────────────────────────────────┼──────────────────┼──────────────────┤
  │ SEV2        │ Service degraded, significant user impact,  │ 15 minutes       │ 30 minutes       │
  │ (High)      │ revenue impact $1K-$10K/hour                │                  │ (L2 → L3)        │
  ├─────────────┼─────────────────────────────────────────────┼──────────────────┼──────────────────┤
  │ SEV3        │ Minor degradation, workaround available,    │ 1 hour           │ 2 hours          │
  │ (Medium)    │ revenue impact < $1K/hour                   │                  │ (L3 → L4)        │
  ├─────────────┼─────────────────────────────────────────────┼──────────────────┼──────────────────┤
  │ SEV4        │ Cosmetic issue, no user impact,             │ Next business    │ N/A              │
  │ (Low)       │ internal tooling issue                      │ day              │                  │
  └─────────────┴─────────────────────────────────────────────┴──────────────────┴──────────────────┘

INCIDENT STATISTICS (Last 90 Days):
  Total incidents: 47
    SEV1: 3 (6.4%) — 3 critical incidents
    SEV2: 12 (25.5%) — 12 high-priority incidents
    SEV3: 22 (46.8%) — 22 medium-priority incidents
    SEV4: 10 (21.3%) — 10 low-priority incidents
  
  Mean Time to Detect (MTTD): 4.2 minutes (SEV1: 2.1 min, SEV2: 5.3 min, SEV3: 8.7 min)
  Mean Time to Acknowledge (MTTA): 6.8 minutes (SEV1: 3.2 min, SEV2: 8.5 min, SEV3: 12.4 min)
  Mean Time to Resolve (MTTR): 38 minutes (SEV1: 62 min, SEV2: 42 min, SEV3: 25 min)
  
  Incident trends:
    September 2024: 18 incidents (1 SEV1, 5 SEV2, 8 SEV3, 4 SEV4)
    October 2024: 14 incidents (0 SEV1, 4 SEV2, 6 SEV3, 4 SEV4) — improvement
    November 2024: 15 incidents (2 SEV1, 3 SEV2, 8 SEV3, 2 SEV4) — SEV1 regression

SEV1 INCIDENT LOG (Last 90 Days):
  Incident 1: Payment Service Outage (September 12, 2024)
    Duration: 47 minutes (detect: 2 min, resolve: 45 min)
    Impact: 12,000 failed payment attempts, estimated revenue loss: $48,000
    Root cause: Database connection pool exhaustion (deploy increased pool size, reduced timeout)
    Resolution: Rolled back deployment + increased pool limit + added circuit breaker
    Postmortem: Blameless — deployment process improved (added connection pool health check)
    
  Incident 2: Auth Service Certificate Expiry (October 28, 2024)
    Duration: 23 minutes (detect: 8 min, resolve: 15 min)
    Impact: 45,000 failed authentication attempts, all services dependent on auth affected
    Root cause: TLS certificate expired (auto-renewal failed due to DNS propagation delay)
    Resolution: Emergency certificate renewal + manual DNS update
    Postmortem: Blameless — cert-manager configuration fixed + earlier expiry alerting (14 days)
    
  Incident 3: DDoS Attack on API Gateway (November 15, 2024)
    Duration: 82 minutes (detect: 3 min, resolve: 79 min)
    Impact: Service degraded to 40% capacity, 23% of requests dropped
    Root cause: Volumetric DDoS attack (2.4 Tbps from 12,000+ IPs)
    Resolution: Activated DDoS mitigation (AWS Shield Advanced) + rate limiting + IP blocking
    Postmortem: Blameless — DDoS response playbook created + automatic Shield activation

BLAMELESS POSTMORTEM TEMPLATE:
  1. SUMMARY (executive overview):
     What happened, when, impact, resolution time
  
  2. TIMELINE (minute-by-minute reconstruction):
     - 14:23: Incident detected (alert fired)
     - 14:25: On-call acknowledged
     - 14:30: War room established
     - 14:35: Root cause identified
     - 14:45: Fix deployed
     - 14:50: Service recovered
     - 15:00: Monitoring confirmed stable
  
  3. ROOT CAUSE (technical explanation):
     What failed, why it failed, contributing factors
  
  4. IMPACT (quantified):
     Users affected, revenue impact, SLI/SLO impact, error budget consumed
  
  5. RESOLUTION (what fixed it):
     Immediate fix, permanent fix, verification
  
  6. PREVENTION (action items):
     - [ ] Add monitoring for X (owner, due date)
     - [ ] Implement circuit breaker for Y (owner, due date)
     - [ ] Update runbook for Z (owner, due date)
     - [ ] Add automated test for W (owner, due date)
  
  7. LESSONS LEARNED:
     What went well, what could be improved, systemic changes needed
  
  8. ACKNOWLEDGMENTS:
     Thank people who responded, no blame assigned
```

## Toil Reduction

### Toil Management Framework

```
TOIL REDUCTION — SRE FRAMEWORK
================================

Toil Definition: Manual, repetitive, automatable work that scales linearly with service scale
Toil Target: < 50% of SRE team capacity (Google SRE standard)
Current Toil: 32% (target met — 32% < 50%)

TOIL INVENTORY (Current Monthly Hours):
  ┌───────────────────────────────────────┬──────────┬──────────┬────────────────────┐
  │ Task                                  │ Hours/mo │ Engineers│ Automatable?       │
  ├───────────────────────────────────────┼──────────┼──────────┼────────────────────┤
  │ SSL certificate renewal               │ 4        │ 1        │ YES — cert-manager │
  │ Database backup verification          │ 8        │ 1        │ YES — automated    │
  │                                   │          │          │   script            │
  │ Log rotation management               │ 3        │ 1        │ YES — automated    │
  │                                   │          │          │   via config        │
  │ On-call incident triage               │ 16       │ 2        │ PARTIAL — auto-    │
  │                                   │          │          │   triage for known  │
  │                                   │          │          │   patterns          │
  │ Capacity planning reports             │ 6        │ 1        │ YES — automated    │
  │                                   │          │          │   dashboards        │
  │ Access request fulfillment            │ 10       │ 1        │ YES — self-service │
  │                                   │          │          │   portal            │
  │ Environment provisioning              │ 8        │ 1        │ YES — Terraform    │
  │                                   │          │          │   + GitOps          │
  │ Compliance evidence collection        │ 12       │ 1        │ PARTIAL — CSPM     │
  │                                   │          │          │   auto-collect      │
  │ Performance baseline reports          │ 4        │ 1        │ YES — automated    │
  │                                   │          │          │   reports           │
  │ Deployment rollback (manual)          │ 2        │ 1        │ YES — auto-        │
  │                                   │          │          │   rollback          │
  │ Health check response                 │ 6        │ 1        │ PARTIAL — auto-    │
  │                                   │          │          │   respond           │
  │ Other toil (miscellaneous)            │ 6        │ 1        │ Varies             │
  └───────────────────────────────────────┴──────────┴──────────┴────────────────────┘
  Total toil hours: 85 hours/month (across 12 SRE engineers)
  Per engineer: 7.1 hours/month (1.8 hours/week)
  Toil percentage: 32% of SRE capacity (target: < 50%) ✓

TOIL REDUCTION PROJECTS (Last 6 Months):
  Completed:
    1. Automated SSL certificate renewal (cert-manager)
       Before: 4 hours/month manual renewal
       After: 0.5 hours/month monitoring
       Savings: 3.5 hours/month | Effort: 2 days | ROI: 23x
    
    2. Automated database backup verification
       Before: 8 hours/month manual verification (8 databases × 1 hour each)
       After: 0.25 hours/month (script runs, alerts on failure)
       Savings: 7.75 hours/month | Effort: 3 days | ROI: 39x
    
    3. Self-service access request portal
       Before: 10 hours/month manual access provisioning
       After: 1 hour/month (exception handling only)
       Savings: 9 hours/month | Effort: 1 week | ROI: 12x
    
    4. Automated environment provisioning (Terraform + GitOps)
       Before: 8 hours/manual environment setup
       After: 0.5 hours (merge PR, ArgoCD syncs)
       Savings: 7.5 hours/month | Effort: 2 weeks | ROI: 15x
    
    5. Automated performance baseline reports
       Before: 4 hours/month manual report generation
       After: 0 hours (automated Grafana reports, emailed weekly)
       Savings: 4 hours/month | Effort: 2 days | ROI: infinite

  In Progress:
    6. Auto-triage for known incident patterns
       Current: 16 hours/month manual triage
       Target: 4 hours/month (75% automation for known patterns)
       ETA: 2 months
  
    7. Automated compliance evidence collection
       Current: 12 hours/month manual collection
       Target: 2 hours/month (CSPM auto-collection + exception handling)
       ETA: 3 months

RUNBOOK LIBRARY:
  Total runbooks: 48 (covering all services + infrastructure components)
  Runbook quality:
    Excellent (step-by-step, automated commands, decision trees): 28 (58.3%)
    Good (step-by-step, manual commands): 14 (29.2%)
    Fair (high-level guidance): 6 (12.5%) — improvement needed
  
  Runbook usage (last 30 days):
    Consulted during incidents: 22 runbooks (45.8% of total)
    Most used runbooks:
      1. Database connection pool exhaustion (8 uses)
      2. API Gateway high latency (6 uses)
      3. Kubernetes pod crash loop (5 uses)
      4. Certificate expiry (4 uses)
      5. Redis memory pressure (3 uses)
  
  Runbook automation rate:
    Fully automated (runbook executes fix): 12 (25.0%)
    Semi-automated (runbook provides commands): 24 (50.0%)
    Manual (runbook provides guidance only): 12 (25.0%)
    Target: 50% fully automated by end of 2025
```

## Capacity Planning

### Capacity Management Framework

```
CAPACITY PLANNING — SRE FRAMEWORK
===================================

Planning Horizon: 12 months (monthly review, quarterly deep dive, annual forecast)
Forecasting Method: Time-series analysis (prophet) + business input + seasonality adjustment
Planning Frequency: Monthly review + quarterly capacity planning meeting

CAPACITY METRICS:
  ┌────────────────────────┬──────────────────┬──────────────────┬──────────────────┬────────────────────┐
  │ Resource               │ Current Usage    │ Peak Usage       │ Capacity Limit   │ Utilization Target │
  ├────────────────────────┼──────────────────┼──────────────────┼──────────────────┼────────────────────┤
  │ Compute (CPU)          │ 62%              │ 78%              │ 100%             │ 60-70%             │
  │ Memory                 │ 71%              │ 85%              │ 100%             │ 65-75%             │
  │ Storage                │ 48%              │ 52%              │ 100%             │ 50-65%             │
  │ Network bandwidth      │ 35%              │ 58%              │ 100%             │ 40-60%             │
  │ Database connections   │ 55%              │ 72%              │ 100%             │ 50-65%             │
  │ Kubernetes nodes       │ 68%              │ 82%              │ 100%             │ 60-75%             │
  │ Redis memory           │ 42%              │ 55%              │ 100%             │ 40-60%             │
  │ Message queue depth    │ 28%              │ 65%              │ 100%             │ 30-50%             │
  └────────────────────────┴──────────────────┴──────────────────┴──────────────────┴────────────────────┘

  Utilization target rationale:
    Keep 30-40% headroom for traffic spikes, burst handling, and graceful degradation
    Above 75%: Risk of cascading failures during peak
    Below 40%: Over-provisioned (cost waste)

CAPACITY FORECAST (Next 12 Months):
  Growth assumptions:
    Traffic growth: 15% year-over-year (based on business plan)
    User growth: 20% year-over-year (based on marketing projections)
    Data growth: 25% year-over-year (based on current trends)
    Seasonal peaks: Black Friday (+300%), holiday season (+200%), product launches (+150%)
  
  ┌────────────────────────┬──────────────────┬──────────────────┬──────────────────────┐
  │ Month                  │ Projected Traffic│ Capacity Needed  │ Action Required      │
  ├────────────────────────┼──────────────────┼──────────────────┼──────────────────────┤
  │ Jan 2025               │ 100% (baseline)  │ 100% (current)   │ None                 │
  │ Feb 2025               │ 102%             │ 100%             │ None (within buffer) │
  │ Mar 2025               │ 105%             │ 100%             │ None                 │
  │ Apr 2025               │ 108%             │ 100%             │ None                 │
  │ May 2025               │ 110%             │ 105%             │ Plan scaling         │
  │ Jun 2025               │ 112%             │ 110%             │ Add 2 K8s nodes      │
  │ Jul 2025               │ 115%             │ 115%             │ Scale database       │
  │ Aug 2025               │ 118%             │ 115%             │ Add cache layer      │
  │ Sep 2025               │ 120%             │ 120%             │ Scale compute        │
  │ Oct 2025               │ 122%             │ 125%             │ Add CDN edge nodes   │
  │ Nov 2025               │ 160% (Black Fri) │ 160%             │ Pre-scale (Oct 15)   │
  │ Dec 2025               │ 145% (holiday)   │ 140%             │ Scale down (Dec 26)  │
  └────────────────────────┴──────────────────┴──────────────────┴──────────────────────┘

CAPACITY COST PROJECTION:
  Current monthly cloud spend: $184,000
  Projected 12-month spend: $218,000/month (+18.5%, aligned with traffic growth)
  Optimization opportunities:
    Reserved instances: $28,000/month savings (commit to 1-year RI for baseline capacity)
    Spot instances: $12,000/month savings (non-critical workloads on spot)
    Storage tiering: $8,000/month savings (move cold data to cheaper storage)
    Right-sizing: $15,000/month savings (downsize over-provisioned instances)
    Total potential savings: $63,000/month (34.3% reduction from projected spend)
```

## Integration Points

- SRE tooling: Google SRE Books (framework), PagerDuty (incident management), Opsgenie (alerting)
- Monitoring: Datadog, Prometheus/Grafana, New Relic, Dynatrace, VictoriaMetrics
- Incident management: PagerDuty, Opsgenie, Jira Service Management, ServiceNow, FireHydrant, Incident.io
- Communication: Slack (incident channels), Microsoft Teams, Discord (internal)
- Runbooks: Runbooks.io, Notion, Confluence, Google Docs, internal wiki
- Capacity planning: CloudHealth, Spot.io, cast.ai, Kubecost, Prometheus + custom forecasting
- Error tracking: Sentry, Rollbar, Bugsnag, Honeycomb
- CI/CD: GitHub Actions, GitLab CI, Jenkins, ArgoCD, Spinnaker (for safe deployments)
- Chaos engineering: Chaos Monkey, Gremlin, Litmus, Pumba (for resilience testing)
- Configuration management: Terraform, Ansible, Pulumi (for infrastructure automation)
- Secret management: HashiCorp Vault, AWS Secrets Manager, Azure Key Vault
- Observability: OpenTelemetry, Jaeger, Zipkin, Loki, Tempo

## Edge Cases

- **Error budget exhausted mid-quarter**: Service consumed full error budget by week 3 of the month. Response: (1) activate error budget policy (freeze non-critical changes), (2) war room if SEV1 incidents ongoing (focus on stability, not features), (3) root cause analysis (why was budget consumed — incident pattern, slow degradation, one large event?), (4) temporary SLO adjustment (if business-approved, for planned risk), (5) capacity review (was budget too tight for service complexity?).

- **Competing SRE priorities**: Development team wants new feature, SRE team wants reliability work, business wants cost reduction. Resolution: (1) error budget as decision framework (if budget healthy → features OK, if budget low → reliability first), (2) SRE capacity allocation (50% reliability, 30% projects, 20% incidents — Google standard), (3) quarterly planning (align SRE roadmap with business priorities), (4) reliability as enabler (frame reliability work as "enabling faster feature delivery").

- **On-call burnout**: SRE engineer on-call too frequently, experiencing alert fatigue and stress. Resolution: (1) fair rotation (equal distribution, account for PTO, holidays), (2) alert quality (reduce noise — only actionable alerts page on-call), (3) follow-the-sun model (global team, share on-call by timezone), (4) compensation (on-call pay differential, compensatory time off), (5) mandatory cooldown (minimum 2 weeks between on-call shifts).

- **Blameless culture vs accountability**: Postmortem is blameless, but same person keeps causing incidents through risky changes. Resolution: (1) blameless ≠ consequence-free (focus on systemic fixes, not individual punishment), (2) patterns matter (one incident = learning opportunity, repeated pattern = process issue), (3) change management (require peer review for risky changes, not blame but prevention), (4) training and mentoring (support engineers who struggle with reliability practices), (5) psychological safety (engineers must feel safe reporting mistakes).

- **SLO too strict for immature service**: New service has same SLO as mature service (99.9%), but immature service can't achieve it yet. Resolution: (1) progressive SLO targets (start at 99.0%, increase by 0.1% quarterly until 99.9%), (2) SLO based on user impact (new feature with 5% of traffic gets lower SLO than core feature), (3) SLO review process (quarterly review, adjust based on service maturity), (4) error budget pooling (group related services, share budget — reduces per-service pressure), (5) communication (explain SLO progression to stakeholders, manage expectations).

- **Capacity planning accuracy**: Traffic grows 50% faster than forecasted (viral event, marketing campaign). Resolution: (1) auto-scaling as safety net (scale automatically before capacity reached), (2) real-time capacity monitoring (alert at 75% utilization, not wait for forecast), (3) emergency provisioning (pre-approved budget for emergency scaling), (4) traffic shedding (graceful degradation — disable non-critical features under load), (5) post-event review (update forecasting model, incorporate viral event patterns).

- **Multi-service cascade failure**: Single service failure cascades to 8 dependent services. Resolution: (1) circuit breakers (each service has circuit breaker for dependencies), (2) bulkhead pattern (isolate failures — one service failure doesn't exhaust all resources), (3) timeout configuration (short timeouts prevent cascade — fail fast), (4) dependency map (know which services depend on which — plan for failure), (5) chaos engineering (test cascade scenarios regularly, validate resilience).

- **Reliability vs speed tension**: Engineering wants to deploy fast (daily releases), SRE wants reliability (slow, careful deployments). Resolution: (1) automated reliability checks in CI/CD (fast AND reliable — automation resolves tension), (2) feature flags (deploy code without enabling — separate deployment from release), (3) canary deployments (gradual rollout — fast deployment, slow exposure), (4) error budget as shared metric (both teams optimize for same metric), (5) trunk-based development (small, frequent changes — less risk per change).

- **Legacy system with no SLOs**: Legacy monolith has never had SLOs defined, no monitoring, unknown reliability. Resolution: (1) baseline measurement (measure current reliability before setting targets), (2) start with availability SLO only (simplest metric, build from there), (3) add monitoring incrementally (APM agent → error tracking → distributed tracing), (4) progressive SLO definition (start generous, tighten quarterly), (5) modernization roadmap (SLOs drive migration decisions — if reliability cost > migration cost, migrate).

- **Executive pressure to ignore error budget**: Leadership wants feature release, but error budget policy says freeze. Resolution: (1) data-driven conversation (show error budget consumption, SLO breach risk, customer impact), (2) risk acceptance process (executive can formally accept risk, documented), (3) alternative paths (release to subset of users, behind feature flag, with enhanced monitoring), (4) SLO alignment (ensure SLOs reflect business priorities — if execs disagree with SLO, adjust SLO), (5) escalation framework (SRE has authority to block, but clear path for executive override with accountability).
