IT AI Skill

Performance Engineering

Conduct performance engineering including load testing, stress testing, bottleneck analysis, capacity planning, and performance optimization for applications, databases, and infrastructure. Use when planning performance tests, analyzing bottlenecks, optimizing application performance, or conducting capacity planning. Triggers on phrases like "performance engineering", "load testing", "stress testing", "capacity planning", "bottleneck analysis", "performance optimization", "response time", "throughput", "concurrency", "APM", "profiling", "JMeter", "k6", "Gatling", "load generator", "performance baseline", "scaling test", "soak test", "spike test".

Performance Engineering

Conduct performance engineering including load testing, stress testing, bottleneck analysis, capacity planning, and performance optimization.

Workflow

1. Performance Testing Strategy

PERFORMANCE TESTING TYPES
═══════════════════════════════════════

Test Type          Purpose                    Duration     Load Pattern    Metrics
───────────────────────────────────────────────────────────────────────────────
Baseline           Establish reference        30 min       Single user     Response time per endpoint
Load               Expected production load   1 hour       Ramp → steady   P95/P99, throughput
Stress             Beyond expected load       2 hours      Ramp to break   Breaking point, degradation
Endurance (Soak)   Long-term stability        24-72 hours  Steady          Memory leak, resource exhaustion
Spike              Sudden load surge          30 min       Normal → spike  Recovery time, errors
Scalability        Auto-scaling validation    1 hour       Progressive     Scale-up/down time

PERFORMANCE TARGETS (SLA):
═══════════════════════════════════════

Endpoint             P50     P95     P99     Throughput  Error Rate
───────────────────────────────────────────────────────────────────────
API: GET /products    50ms   150ms   250ms   1000 RPS    < 0.1%
API: POST /orders     100ms  300ms   500ms   500 RPS     < 0.1%
Web: Homepage         200ms  500ms   800ms   2000 RPS    < 0.1%
Web: Checkout         300ms  800ms   1200ms  200 RPS     < 0.05%
DB: Query             10ms   50ms    100ms   5000 QPS    < 0.01%

LOAD MODEL:
═══════════════════════════════════════

  Production baseline:
    → 10,000 concurrent users (peak)
    → 500 requests/second (avg)
    → 1,200 requests/second (peak)
    → 200,000 daily transactions

  Test scenarios:
    → Normal: 1x production load (baseline validation)
    → Peak: 2x production load (capacity validation)
    → Black Friday: 5x production load (stress test)
    → Soak: 1x production for 72 hours (stability)

2. Load Testing Execution

LOAD TESTING EXECUTION (k6 / JMeter / Gatling)
═══════════════════════════════════════

Test Script (k6):
═══════════════════════════════════════

  import http from 'k6/http';
  import { check, sleep } from 'k6';
  import { execute, staging } from 'k6/execution';

  export const options = {
    stages: [
      { duration: '5m', target: 100 },   // Ramp up
      { duration: '15m', target: 100 },  // Steady state
      { duration: '5m', target: 300 },   // Spike
      { duration: '5m', target: 300 },   // Hold spike
      { duration: '5m', target: 0 },     // Ramp down
    ],
    thresholds: {
      http_req_duration: ['p(95)<300', 'p(99)<500'],
      http_req_failed: ['rate<0.01'],
      http_reqs: ['rate>500'],
    },
  };

  export default function () {
    const res = http.get('https://api.example.com/products');
    check(res, {
      'status is 200': (r) => r.status === 200,
      'response time < 500ms': (r) => r.timings.duration < 500,
    });
    sleep(1);
  }

TEST ENVIRONMENT:
═══════════════════════════════════════

  → Isolated from production (no customer impact)
  → Mirrored infrastructure (same spec, less data)
  → Data: Anonymized production data (10% sample)
  → Network: Same architecture (LB, CDN, WAF)
  → Monitoring: Full APM (Datadog, New Relic, Prometheus)

  Load generators:
    → Cloud-based (k6 Cloud, BlazeMeter)
    → Multiple regions (geographic distribution)
    → 10+ generators (parallel execution)

3. Bottleneck Analysis

BOTTLENECK ANALYSIS FRAMEWORK
═══════════════════════════════════════

Layer-by-Layer Analysis:
═══════════════════════════════════════

1. Network Layer:
   → Bandwidth utilization (>80% = bottleneck)
   → Latency between tiers
   → DNS resolution time
   → SSL/TLS handshake time
   → CDN hit ratio

2. Application Server Layer:
   → CPU utilization (>70% sustained = bottleneck)
   → Memory usage (leaks, GC pressure)
   → Thread pool exhaustion
   → Connection pool exhaustion
   → Disk I/O (slow disk, queue depth)

3. Database Layer:
   → CPU (>70% = query optimization needed)
   → IOPS (>80% of limit = storage upgrade)
   → Connection pool (maxed out = increase pool)
   → Slow queries (>100ms = add index/rewrite)
   → Lock contention (deadlocks, blocking)
   → Buffer pool hit ratio (<99% = increase memory)

4. External Services:
   → Third-party API latency
   → Payment gateway timeout
   → Email service queue depth
   → Search engine response time

PERFORMANCE PROFILING:
═══════════════════════════════════════

  Application Profiling:
    → Flame graphs (CPU profiling)
    → Memory profiling (heap dumps, GC logs)
    → APM traces (distributed tracing)
    → Database query profiling (EXPLAIN plans)
    → Thread dumps (active threads analysis)

  Tools:
    → APM: Datadog APM, New Relic, Dynatrace
    → Profiling: pprof, async-profiler, VisualVM
    → DB: pg_stat_statements, MySQL EXPLAIN, SQL Profiler
    → Network: tcpdump, Wireshark, eBPF

4. Capacity Planning

CAPACITY PLANNING
═══════════════════════════════════════

Current vs Projected:
═══════════════════════════════════════

  Metric              Current    Growth Rate   6-Month Projection  Threshold
  ─────────────────────────────────────────────────────────────────────────────
  Users               50K        +20%/quarter  148K                200K (scale)
  RPS                 500        +20%/quarter  1,476               2,000 (scale)
  Storage             10 TB      +15%/month    40 TB               50 TB (expand)
  Database rows       100M       +5%/month     163M                500M (shard)
  Bandwidth           50 Mbps    +10%/month    114 Mbps            1 Gbps (upgrade)

Scaling Triggers:
═══════════════════════════════════════

  Horizontal (add instances):
    → CPU > 70% sustained for 5 minutes
    → Memory > 80% sustained for 5 minutes
    → Request queue > 100
    → Auto-scaling: Min 3, Max 20, Target 60% CPU

  Vertical (upgrade instance):
    → CPU consistently at 90%+
    → Memory insufficient (cannot scale out enough)
    → Disk IOPS at limit
    → Network bandwidth saturated

  Database Scaling:
    → Read replicas (read-heavy)
    → Sharding (data volume)
    → Partitioning (query performance)
    → Caching (Redis/Memcached, hot data)

CAPACITY REPORT:
═══════════════════════════════════════

  → Monthly capacity review
  → Budget forecast (next 6-12 months)
  → Procurement timeline (lead times)
  → Cost optimization opportunities
  → Right-sizing recommendations

5. Performance Optimization

PERFORMANCE OPTIMIZATION CHECKLIST
═══════════════════════════════════════

Frontend:
═══════════════════════════════════════

  → CDN for static assets (images, CSS, JS)
  → Lazy loading (below-the-fold content)
  → Image optimization (WebP, responsive sizes)
  → Code splitting (Webpack chunks)
  → Caching strategy (HTTP headers, service worker)
  → DNS prefetch + preconnect
  → Minify + compress (Gzip/Brotli)

Backend:
═══════════════════════════════════════

  → Connection pooling (DB, Redis, HTTP)
  → Async processing (message queue for heavy tasks)
  → Caching (Redis, Memcached, CDN)
  → Database indexing (covering indexes)
  → Query optimization (EXPLAIN, batch operations)
  → N+1 query elimination
  → Pagination (cursor-based for large datasets)
  → Compression (response, inter-service)

Infrastructure:
═══════════════════════════════════════

  → Auto-scaling (CPU/memory/request-based)
  → Load balancing (least connections, round robin)
  → Health checks (remove unhealthy instances)
  → Keep-alive connections (HTTP, DB)
  → CDN edge caching (dynamic content)
  → Service mesh (retry, circuit breaker, timeout)

BEFORE vs AFTER OPTIMIZATION:
═══════════════════════════════════════

  Metric            Before       After        Improvement
  ────────────────────────────────────────────────────────────────
  P95 response time 850ms       180ms        79% faster
  P99 response time 1500ms      350ms        77% faster
  Throughput        500 RPS     2,000 RPS    4x increase
  Error rate        0.5%        0.02%        96% reduction
  CPU utilization   85%         35%          50% reduction
  Database queries  150ms avg   25ms avg     83% faster

Edge Cases

Geo-distributed: Multi-region latency, DNS routing
Bursty traffic: Social media viral spikes
Long-running requests: Streaming, WebSocket, SSE
Large payloads: File upload/download, batch processing
Third-party dependencies: External API SLA constraints

Integration Points

Testing tools: k6, JMeter, Gatling, Locust, Artillery
APM: Datadog, New Relic, Dynatrace, AppDynamics
Monitoring: Prometheus, Grafana, CloudWatch
Profiling: pprof, async-profiler, VisualVM
CI/CD: GitHub Actions, GitLab CI (performance gates)
Infrastructure: Kubernetes, AWS, Azure, GCP

Output

Performance Engineering Status

PERFORMANCE STATUS — Q4 2024
═══════════════════════════════════════

SLA compliance: 99.5% (target: 99.9%)
P95 response time: 180ms (target: <300ms) ✓
Throughput: 2,000 RPS (target: >1,000 RPS) ✓
Last load test: Passed (2x production)
Next stress test: Q1 2025 (5x production)
Capacity: 6 months headroom (all resources)
Optimization wins: 79% faster (post-optimization)
Open issues: 2 (DB slow query, cache invalidation)

Disclaimer: All rights reserved by Circulos AI. These skills are specifically designed for Claude Code, Claude Cowork, Codex, and OpenClaw. When using or referencing any skill, please provide proper attribution to Circulos AI.