---
name: performance-tuning
description: Optimize system performance including application performance monitoring, bottleneck identification, capacity planning, resource optimization, auto-scaling configuration, performance benchmarking, and SLA management. Use when tuning application performance, identifying bottlenecks, planning capacity, optimizing resource utilization, or managing performance SLAs. Triggers on phrases like "performance tuning", "APM", "application performance", "bottleneck", "capacity planning", "auto-scaling", "resource optimization", "benchmark", "SLA performance", "latency", "throughput", "response time", "P95", "P99", "performance degradation", "load testing", "stress testing", "profiling", "memory leak", "CPU spike", "connection pool".
---

# Performance Tuning

Proactively monitor, analyze, and optimize application and infrastructure performance to meet SLAs and user expectations.

## Application Performance Management

### APM & Monitoring

```
APPLICATION PERFORMANCE MANAGEMENT:
══════════════════════════════════

APM PLATFORM: Datadog APM + New Relic (legacy, migrating) + custom dashboards
  Coverage: 100% of production applications (28 apps)
  Instrumentation: OpenTelemetry (standard) + vendor agent (Datadog)
  Trace sampling: 10% (standard), 100% (error, P99 slow)
  Retention: 15 days (hot), 90 days (cold), 1 year (summary)

APPLICATION INVENTORY:
  ┌──────────────────────────┬──────────┬──────────┬──────────┬──────────┐
  │ Application              │ Tier     │ SLA      │ P95      │ P99      │
  │                          │          │ Target   │ Actual   │ Actual   │
  ├──────────────────────────┼──────────┼──────────┼──────────┼──────────┤
  │ Customer API             │ 1        │ <200ms   │ 120ms    │ 180ms    │
  │ Web portal               │ 1        │ <500ms   │ 310ms    │ 450ms    │
  │ Payment service          │ 1        │ <150ms   │ 95ms     │ 140ms    │
  │ Auth service (SSO)       │ 1        │ <100ms   │ 65ms     │ 90ms     │
  │ Internal web app         │ 2        │ <1s      │ 420ms    │ 850ms    │
  │ CRM integration          │ 2        │ <1s      │ 680ms    │ 920ms    │
  │ HR system                │ 2        │ <2s      │ 890ms    │ 1.7s     │
  │ Reporting/BI             │ 3        │ <5s      │ 2.1s     │ 4.2s     │
  │ Background jobs          │ 3        │ N/A      │ N/A      │ N/A      │
  │ ────────────────────── │ ────── │ ────── │ ────── │ ────── │
  │ Total                  │ 28 apps│ All met  │ All met  │ All met  │
  └──────────────────────────┴──────────┴──────────┴──────────┴──────────┘

  SLA compliance (January 2025):
    Tier 1 (critical): 99.95% uptime (target: 99.9%) ✓
    Tier 2 (important): 99.9% uptime (target: 99.5%) ✓
    Tier 3 (standard): 99.5% uptime (target: 99%) ✓
    Overall: 99.92% (target: 99.9%) ✓
  
  Performance budget (per application):
    Error rate: <0.1% (tier 1), <0.5% (tier 2), <1% (tier 3)
    P95 latency: Per SLA (above table)
    P99 latency: 2× P95 (max.)
    Apdex score: >0.9 (satisfied / total — tier 1)
  
  January 2025 performance:
    Error rate: 0.03% (tier 1), 0.15% (tier 2), 0.4% (tier 3) — all within budget ✓
    Apdex score: 0.94 (tier 1) — within budget ✓
    Slow requests (>SLA): 0.02% — within budget ✓

PERFORMANCE BENCHMARKING:
  Baseline (established Q3 2024):
    ┌──────────────────────────┬──────────┬──────────┬──────────┐
    │ Metric                   │ Q3 2024  │ Q4 2024  │ Q1 2025  │
    ├──────────────────────────┼──────────┼──────────┼──────────┤
    │ Avg. response time       │ 145ms    │ 125ms    │ 118ms    │
    │ P95 response time        │ 280ms    │ 245ms    │ 220ms    │
    │ P99 response time        │ 420ms    │ 365ms    │ 310ms    │
    │ Error rate               │ 0.12%    │ 0.05%    │ 0.03%    │
    │ Throughput (req/sec)     │ 1,200    │ 1,450    │ 1,680    │
    │ Apdex                    │ 0.91     │ 0.93     │ 0.94     │
    │ ────────────────────── │ ────── │ ────── │ ────── │
    │ Trend                  │ Baseline│ Improved │ Improving│
    └──────────────────────────┴──────────┴──────────┴──────────┘
  
  Load testing:
    Tool: k6 (load test) + Locust (stress test)
    Frequency: Pre-release (every major release) + quarterly (baseline)
    Last test: January 2025 (pre-release v2.8)
    Result: All SLAs met (up to 2× expected traffic)
    Next test: March 2025 (quarterly baseline)
  
  Stress testing:
    Frequency: Annual (full system — identify breaking point)
    Last test: November 2024
    Breaking point: 4× expected traffic (graceful degradation — not crash)
    Recovery: Auto-scale + circuit breaker (self-healing — 2 min)

PERFORMANCE BUDGET TRACKING:
  Budget burn:
    Error budget: 0.1% (tier 1) → consumed: 0.03% (30% of budget used) ✓
    Latency budget: <200ms P95 → actual: 120ms P95 (60% of budget used) ✓
    Availability budget: 99.9% → actual: 99.95% (exceeded — budget remaining) ✓
  
  Alert thresholds:
    Error rate >0.05%: Warning (investigate)
    Error rate >0.08%: Critical (action required — budget burn rate high)
    Error rate >0.1%: Emergency (budget exhausted — rollback if release)
    P95 >150ms: Warning (approaching SLA)
    P95 >180ms: Critical (near SLA breach)
    P95 >200ms: Emergency (SLA breach — immediate action)
```

## Bottleneck Analysis & Resolution

### Root Cause & Optimization

```
BOTTLENECK ANALYSIS:
════════════════════

MONITORING STACK:
  APM: Datadog APM (distributed tracing, flame graphs, code-level profiling)
  Infrastructure: Datadog (CPU, memory, disk, network — host + container)
  Logs: Datadog Logs + ELK (error log, access log, application log)
  Synthetics: Datadog Synthetics (24/7 uptime check + API monitoring)
  Real User Monitoring: Datadog RUM (frontend performance, user experience)
  Database: pg_stat_statements (PostgreSQL), Performance Schema (MySQL)
  Network: Datadog Network Monitor + NetFlow + DNS monitor
  
  Alerting:
    PagerDuty: Critical (P1, P2 — immediate page)
    Datadog: Warning (P3 — Slack notification)
    Escalation: 15 min (no acknowledgment → escalate to team lead → manager)
    On-call: 24/7 (rotating — tier 1 apps), business hours (tier 2-3)

RECENT BOTTLENECK RESOLUTIONS (January 2025):
  ┌──────────────────────────┬──────────┬──────────┬──────────┐
  │ Issue                    │ Root Cause│ Fix     │ Result  │
  ├──────────────────────────┼──────────┼──────────┼──────────┤
  │ API latency spike (Jan 5)│ N+1 query│ Eager load│ P95: 280→120ms│
  │                          │          │ + index  │ (57%↓)   │
  ├──────────────────────────┼──────────┼──────────┼──────────┤
  │ Memory leak (Jan 12)     │ Cache    │ TTL +    │ Memory:  │
  │ (auth service)           │ unbounded│ max size │ 8GB→4GB  │
  │                          │          │          │ (50%↓)   │
  ├──────────────────────────┼──────────┼──────────┼──────────┤
  │ Connection pool          │ Pool too │ Resize   │ Conn wait│
  │ exhaustion (Jan 18)      │ small    │ 20→50    │ 45ms→2ms │
  │                          │          │          │ (96%↓)   │
  ├──────────────────────────┼──────────┼──────────┼──────────┤
  │ CPU spike (Jan 22)       │ Regex    │ Optimize │ CPU: 85→42%│
  │ (reporting service)      │ compile  │ pre-compl│ (51%↓)   │
  │                          │ loop     │          │           │
  ├──────────────────────────┼──────────┼──────────┼──────────┤
  │ Disk I/O (Jan 28)        │ Log file │ Rotate + │ I/O: 95→35%│
  │ saturation               │ too large│ compress │ (63%↓)   │
  │                          │          │ + offload│           │
  └──────────────────────────┴──────────┴──────────┴──────────┘

  Total incidents: 5 (January)
  Mean Time to Detect (MTTD): 8 minutes (avg.)
  Mean Time to Resolve (MTTR): 45 minutes (avg.)
  Repeat incidents: 0 (root cause fully resolved)

COMMON BOTTLENECK PATTERNS:
  1. Database (40% of incidents):
     - Slow queries (missing index, suboptimal plan)
     - Lock contention (deadlock, long transaction)
     - Connection pool exhaustion (too small, leak)
     - Table bloat (no vacuum, no partition)
     - Replication lag (read replica behind primary)
  
  2. Application (30% of incidents):
     - N+1 queries (eager load missing)
     - Memory leak (cache unbounded, object not released)
     - CPU-intensive operation (regex, JSON parse, encryption)
     - Thread pool exhaustion (async not used, blocking I/O)
     - Serialization bottleneck (single thread, no parallelism)
  
  3. Infrastructure (20% of incidents):
     - Auto-scale slow (scale-up threshold too high)
     - Network latency (cross-region, cross-AZ)
     - Disk I/O saturation (log, temp file, swap)
     - Container OOM (memory limit too low)
     - DNS resolution slow (cache miss, DNS provider issue)
  
  4. External (10% of incidents):
     - 3rd party API slow (rate limit, downtime)
     - CDN cache miss (TTL too short, invalidation)
     - Payment gateway timeout (network, provider issue)
     - DNS provider outage (failover)
     - Certificate expiry (auto-renewal failed)

PERFORMANCE OPTIMIZATION TECHNIQUES:
  Code-level:
    - Algorithm optimization (O(n²) → O(n log n))
    - Caching (Redis — session, data, API response)
    - Lazy loading (defer non-critical computation)
    - Async processing (queue — background job)
    - Connection pooling (database, HTTP client)
    - Batch operations (bulk insert, bulk update)
  
  Infrastructure-level:
    - Auto-scaling (CPU, memory, custom metric)
    - CDN (static content, API cache, image optimization)
    - Load balancing (round-robin, least connection)
    - Read replicas (offload read queries)
    - Database optimization (index, partition, materialized view)
    - Container optimization (resource limit, HPA)
  
  Architecture-level:
    - Microservice decomposition (monolith → service)
    - CQRS (separate read/write — scale independently)
    - Event-driven (async — event bus, message queue)
    - Edge computing (CDN, edge function — reduce latency)
    - Data locality (same AZ/region — reduce network)
```

## Capacity Planning

### Growth & Scaling

```
CAPACITY PLANNING:
══════════════════

CURRENT CAPACITY:
  Compute:
    EC2 instances: 120 (on-demand: 80, reserved: 30, spot: 10)
    K8s nodes: 35 (on-demand: 20, spot: 15)
    Azure VMs: 30 (on-demand: 25, reserved: 5)
    Total vCPUs: 480 (avg. utilization: 42%)
    Total RAM: 1.9 TB (avg. utilization: 55%)
  
  Storage:
    EBS: 45 TB (avg. utilization: 60%)
    EFS: 15 TB (avg. utilization: 40%)
    Azure Disk: 12 TB (avg. utilization: 50%)
    Azure Blob: 8 TB (avg. utilization: 35%)
  
  Network:
    Bandwidth: 1 Gbps (avg. utilization: 55%, peak: 78%)
    API rate limit: 10,000 req/sec (avg. usage: 680 req/sec)
    Load balancer: 20,000 conn/sec (avg. usage: 4,500)
  
  Database:
    RDS: 12 instances (avg. CPU: 38%, IOPS: 42%)
    ElastiCache (Redis): 4 instances (avg. memory: 55%)
    Azure SQL: 6 instances (avg. DTU: 45%)

GROWTH PROJECTION:
  ┌──────────────────────────┬──────────┬──────────┬──────────┐
  │ Metric                   │ Current  │ 6 months │ 12 months│
  ├──────────────────────────┼──────────┼──────────┼──────────┤
  │ Users                    │ 485      │ 580      │ 700      │
  │ API calls/day            │ 8.5M     │ 12.75M   │ 17M      │
  │ Storage                  │ 72 TB    │ 90 TB    │ 110 TB   │
  │ Compute (vCPU)           │ 480      │ 600      │ 750      │
  │ Database (IOPS)          │ 8,500    │ 12,750   │ 17,000   │
  │ Bandwidth                │ 550 Mbps │ 700 Mbps │ 900 Mbps │
  │ ────────────────────── │ ────── │ ────── │ ────── │
  │ Growth rate            │        │ +20%    │ +45%    │
  └──────────────────────────┴──────────┴──────────┴──────────┘

  Scaling triggers:
    Compute: CPU >70% (15 min sustained) → auto-scale (add node)
    Storage: >80% (alert) → >85% (expand) → >90% (emergency)
    Database: CPU >70% → read replica; IOPS >80% → resize
    Bandwidth: >75% (alert) → >85% (expand) → >90% (emergency)
    Memory: >75% (alert) → >85% (scale/expand)
  
  Auto-scaling configuration:
    K8s HPA: CPU >60% (scale up), CPU <30% (scale down)
    AWS ASG: CPU >65% (scale up), CPU <35% (scale down)
    Azure VMScaleSet: CPU >60% (scale up), CPU <30% (scale down)
    Scale-up time: <2 minutes (K8s), <3 minutes (EC2 ASG)
    Scale-down time: <5 minutes (cool-down: 10 minutes)
    Min instances: 2 (HA), Max instances: 20 (per service)
  
  Capacity budget (2025):
    Compute: $300K (projected — auto-scale may increase 10-20%)
    Storage: $60K (projected — growth 45%, lifecycle to reduce)
    Network: $48K (projected — expansion Q2: +$24K)
    Database: $120K (projected — resize at trigger)
    Total: $528K (projected — vs. budget $540K — under)

PERFORMANCE REVIEW:
  Cadence: Monthly (team), Quarterly (executive)
  
  Metrics reviewed:
    SLA compliance (uptime, latency, error rate)
    Performance budget burn (remaining budget)
    Capacity utilization (current + projected)
    Cost per transaction (efficiency)
    User experience (RUM score, Apdex)
    Incident trends (frequency, MTTR, root cause)
  
  Last review (January 2025):
    All SLAs met ✓
    Performance budget: 70% remaining ✓
    Capacity: 42% CPU, 55% memory, 60% storage (healthy)
    Cost/transaction: $0.002 (improving — 15% reduction QoQ)
    User experience: Apdex 0.94 (good)
    Incidents: 5 (January) — all resolved, 0 repeat
```

## Output

### Performance Tuning Dashboard

```
PERFORMANCE DASHBOARD — Jan 2025
══════════════════════════════

Application Performance:
  SLA compliance: 99.92% (target: 99.9%) ✓
  P95: 118ms | P99: 310ms | Error rate: 0.03%
  Apdex: 0.94 (satisfied)
  Throughput: 1,680 req/sec (improving +40% since Q3)
  Trend: Improving (Q3→Q4→Q1: all metrics better)

Bottleneck Resolution:
  Incidents: 5 (January) — all resolved
  MTTD: 8 min | MTTR: 45 min
  Repeat: 0 (root cause fully addressed)
  Top causes: Database (40%), Application (30%), Infra (20%)

Capacity:
  Compute: 42% CPU, 55% memory (healthy)
  Storage: 60% (45 TB of 72 TB — projected: 110 TB by 12mo)
  Database: 38% CPU, 42% IOPS (healthy)
  Bandwidth: 55% avg., 78% peak (expand Q2)
  Growth: +20% (6mo), +45% (12mo) — scaling triggers configured

Auto-Scaling:
  K8s: HPA active (min: 2, max: 20)
  EC2: ASG active (scale-up <3 min)
  Azure: VMScaleSet (scale-up <2 min)
  Cool-down: 10 min (prevent flapping)

Cost:
  Cost/transaction: $0.002 (15% reduction QoQ)
  Annual budget: $540K (projected: $528K — under)
  Reserved instances: 35% (savings: ~30%)

Actions:
  1. Quarterly load test (March — baseline)
  2. Bandwidth expansion (1→2 Gbps — Q2)
  3. Reserved instance purchase (Q2 — 3-year)
  4. Performance review (monthly — February)
  5. Stress test (annual — November)
```

## Integration Points

- APM (Datadog, New Relic, AppDynamics): Tracing, metrics, profiling
- Infrastructure monitoring (Datadog, Prometheus, Grafana): Host metrics
- Log management (Datadog, ELK, Splunk): Error analysis, correlation
- Synthetic monitoring (Datadog, Pingdom, Uptime Robot): 24/7 check
- Load testing (k6, Locust, JMeter): Performance validation
- Auto-scaling (K8s HPA, AWS ASG, Azure VMScaleSet): Dynamic scaling
- CDN (CloudFront, Cloudflare): Edge caching, performance
- Database (RDS, ElastiCache, Azure SQL): Query optimization
- CI/CD (GitHub Actions, GitLab CI): Performance gate, pre-release test
- Incident management (PagerDuty, Opsgenie): Alert, escalation
- CMDB (ServiceNow): Asset inventory, capacity baseline
- Communication (Slack, Teams): Alert notification, status update

## Edge Cases

- **Performance degradation (sudden)**: APM analysis; recent deploy rollback; scaling; circuit breaker; monitoring
- **Memory leak (gradual)**: Profiling; heap dump analysis; fix; redeploy; monitoring alert
- **Auto-scale failure (no scale-up)**: ASG health check; launch template; quota; manual scale; root cause
- **Database performance (slow query spike)**: Query plan; index; lock; replication lag; cache
- **External API timeout (3rd party)**: Circuit breaker; retry; fallback; vendor status; communication
- **DDoS (performance impact)**: WAF + Shield; rate limiting; CDN; IP block; provider escalation
- **Capacity exhaustion (sudden growth)**: Emergency scale; cost override; communication; planning update
- **Performance regression (post-release)**: Rollback; APM diff; code review; hotfix; monitoring
- **Certificate expiry (SSL/TLS)**: Auto-renewal fix; manual renewal; monitoring alert (30-day)
- **Cross-region latency (user complaint)**: Edge deployment; CDN; DNS routing; data locality; analysis
