IT AI Skill
Performance Engineering
Conduct performance engineering including load testing, stress testing, bottleneck analysis, capacity planning, and performance optimization for applications, databases, and infrastructure. Use when planning performance tests, analyzing bottlenecks, optimiz...
Performance Engineering
Conduct performance engineering including load testing, stress testing, bottleneck analysis, capacity planning, and performance optimization.
Workflow
1. Performance Testing Strategy
PERFORMANCE TESTING TYPES
═══════════════════════════════════════
Test Type Purpose Duration Load Pattern Metrics
───────────────────────────────────────────────────────────────────────────────
Baseline Establish reference 30 min Single user Response time per endpoint
Load Expected production load 1 hour Ramp → steady P95/P99, throughput
Stress Beyond expected load 2 hours Ramp to break Breaking point, degradation
Endurance (Soak) Long-term stability 24-72 hours Steady Memory leak, resource exhaustion
Spike Sudden load surge 30 min Normal → spike Recovery time, errors
Scalability Auto-scaling validation 1 hour Progressive Scale-up/down time
PERFORMANCE TARGETS (SLA):
═══════════════════════════════════════
Endpoint P50 P95 P99 Throughput Error Rate
───────────────────────────────────────────────────────────────────────
API: GET /products 50ms 150ms 250ms 1000 RPS < 0.1%
API: POST /orders 100ms 300ms 500ms 500 RPS < 0.1%
Web: Homepage 200ms 500ms 800ms 2000 RPS < 0.1%
Web: Checkout 300ms 800ms 1200ms 200 RPS < 0.05%
DB: Query 10ms 50ms 100ms 5000 QPS < 0.01%
LOAD MODEL:
═══════════════════════════════════════
Production baseline:
→ 10,000 concurrent users (peak)
→ 500 requests/second (avg)
→ 1,200 requests/second (peak)
→ 200,000 daily transactions
Test scenarios:
→ Normal: 1x production load (baseline validation)
→ Peak: 2x production load (capacity validation)
→ Black Friday: 5x production load (stress test)
→ Soak: 1x production for 72 hours (stability)
2. Load Testing Execution
LOAD TESTING EXECUTION (k6 / JMeter / Gatling)
═══════════════════════════════════════
Test Script (k6):
═══════════════════════════════════════
import http from 'k6/http';
import { check, sleep } from 'k6';
import { execute, staging } from 'k6/execution';
export const options = {
stages: [
{ duration: '5m', target: 100 }, // Ramp up
{ duration: '15m', target: 100 }, // Steady state
{ duration: '5m', target: 300 }, // Spike
{ duration: '5m', target: 300 }, // Hold spike
{ duration: '5m', target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ['p(95)<300', 'p(99)<500'],
http_req_failed: ['rate<0.01'],
http_reqs: ['rate>500'],
},
};
export default function () {
const res = http.get('https://api.example.com/products');
check(res, {
'status is 200': (r) => r.status === 200,
'response time < 500ms': (r) => r.timings.duration < 500,
});
sleep(1);
}
TEST ENVIRONMENT:
═══════════════════════════════════════
→ Isolated from production (no customer impact)
→ Mirrored infrastructure (same spec, less data)
→ Data: Anonymized production data (10% sample)
→ Network: Same architecture (LB, CDN, WAF)
→ Monitoring: Full APM (Datadog, New Relic, Prometheus)
Load generators:
→ Cloud-based (k6 Cloud, BlazeMeter)
→ Multiple regions (geographic distribution)
→ 10+ generators (parallel execution)
3. Bottleneck Analysis
BOTTLENECK ANALYSIS FRAMEWORK
═══════════════════════════════════════
Layer-by-Layer Analysis:
═══════════════════════════════════════
1. Network Layer:
→ Bandwidth utilization (>80% = bottleneck)
→ Latency between tiers
→ DNS resolution time
→ SSL/TLS handshake time
→ CDN hit ratio
2. Application Server Layer:
→ CPU utilization (>70% sustained = bottleneck)
→ Memory usage (leaks, GC pressure)
→ Thread pool exhaustion
→ Connection pool exhaustion
→ Disk I/O (slow disk, queue depth)
3. Database Layer:
→ CPU (>70% = query optimization needed)
→ IOPS (>80% of limit = storage upgrade)
→ Connection pool (maxed out = increase pool)
→ Slow queries (>100ms = add index/rewrite)
→ Lock contention (deadlocks, blocking)
→ Buffer pool hit ratio (<99% = increase memory)
4. External Services:
→ Third-party API latency
→ Payment gateway timeout
→ Email service queue depth
→ Search engine response time
PERFORMANCE PROFILING:
═══════════════════════════════════════
Application Profiling:
→ Flame graphs (CPU profiling)
→ Memory profiling (heap dumps, GC logs)
→ APM traces (distributed tracing)
→ Database query profiling (EXPLAIN plans)
→ Thread dumps (active threads analysis)
Tools:
→ APM: Datadog APM, New Relic, Dynatrace
→ Profiling: pprof, async-profiler, VisualVM
→ DB: pg_stat_statements, MySQL EXPLAIN, SQL Profiler
→ Network: tcpdump, Wireshark, eBPF
4. Capacity Planning
CAPACITY PLANNING
═══════════════════════════════════════
Current vs Projected:
═══════════════════════════════════════
Metric Current Growth Rate 6-Month Projection Threshold
─────────────────────────────────────────────────────────────────────────────
Users 50K +20%/quarter 148K 200K (scale)
RPS 500 +20%/quarter 1,476 2,000 (scale)
Storage 10 TB +15%/month 40 TB 50 TB (expand)
Database rows 100M +5%/month 163M 500M (shard)
Bandwidth 50 Mbps +10%/month 114 Mbps 1 Gbps (upgrade)
Scaling Triggers:
═══════════════════════════════════════
Horizontal (add instances):
→ CPU > 70% sustained for 5 minutes
→ Memory > 80% sustained for 5 minutes
→ Request queue > 100
→ Auto-scaling: Min 3, Max 20, Target 60% CPU
Vertical (upgrade instance):
→ CPU consistently at 90%+
→ Memory insufficient (cannot scale out enough)
→ Disk IOPS at limit
→ Network bandwidth saturated
Database Scaling:
→ Read replicas (read-heavy)
→ Sharding (data volume)
→ Partitioning (query performance)
→ Caching (Redis/Memcached, hot data)
CAPACITY REPORT:
═══════════════════════════════════════
→ Monthly capacity review
→ Budget forecast (next 6-12 months)
→ Procurement timeline (lead times)
→ Cost optimization opportunities
→ Right-sizing recommendations
5. Performance Optimization
PERFORMANCE OPTIMIZATION CHECKLIST
═══════════════════════════════════════
Frontend:
═══════════════════════════════════════
→ CDN for static assets (images, CSS, JS)
→ Lazy loading (below-the-fold content)
→ Image optimization (WebP, responsive sizes)
→ Code splitting (Webpack chunks)
→ Caching strategy (HTTP headers, service worker)
→ DNS prefetch + preconnect
→ Minify + compress (Gzip/Brotli)
Backend:
═══════════════════════════════════════
→ Connection pooling (DB, Redis, HTTP)
→ Async processing (message queue for heavy tasks)
→ Caching (Redis, Memcached, CDN)
→ Database indexing (covering indexes)
→ Query optimization (EXPLAIN, batch operations)
→ N+1 query elimination
→ Pagination (cursor-based for large datasets)
→ Compression (response, inter-service)
Infrastructure:
═══════════════════════════════════════
→ Auto-scaling (CPU/memory/request-based)
→ Load balancing (least connections, round robin)
→ Health checks (remove unhealthy instances)
→ Keep-alive connections (HTTP, DB)
→ CDN edge caching (dynamic content)
→ Service mesh (retry, circuit breaker, timeout)
BEFORE vs AFTER OPTIMIZATION:
═══════════════════════════════════════
Metric Before After Improvement
────────────────────────────────────────────────────────────────
P95 response time 850ms 180ms 79% faster
P99 response time 1500ms 350ms 77% faster
Throughput 500 RPS 2,000 RPS 4x increase
Error rate 0.5% 0.02% 96% reduction
CPU utilization 85% 35% 50% reduction
Database queries 150ms avg 25ms avg 83% faster
Edge Cases
- Geo-distributed: Multi-region latency, DNS routing
- Bursty traffic: Social media viral spikes
- Long-running requests: Streaming, WebSocket, SSE
- Large payloads: File upload/download, batch processing
- Third-party dependencies: External API SLA constraints
Integration Points
- Testing tools: k6, JMeter, Gatling, Locust, Artillery
- APM: Datadog, New Relic, Dynatrace, AppDynamics
- Monitoring: Prometheus, Grafana, CloudWatch
- Profiling: pprof, async-profiler, VisualVM
- CI/CD: GitHub Actions, GitLab CI (performance gates)
- Infrastructure: Kubernetes, AWS, Azure, GCP
Output
Performance Engineering Status
PERFORMANCE STATUS — Q4 2024
═══════════════════════════════════════
SLA compliance: 99.5% (target: 99.9%)
P95 response time: 180ms (target: <300ms) ✓
Throughput: 2,000 RPS (target: >1,000 RPS) ✓
Last load test: Passed (2x production)
Next stress test: Q1 2025 (5x production)
Capacity: 6 months headroom (all resources)
Optimization wins: 79% faster (post-optimization)
Open issues: 2 (DB slow query, cache invalidation)