IT AI Skill
Database Performance Monitoring
Monitor database health, query performance, resource utilization, replication, and capacity. Use when diagnosing slow queries, monitoring database availability, tracking replication lag, optimizing database performance, planning database capacity, managing...
Database Performance Monitoring
Comprehensive monitoring and optimization of database infrastructure across all environments.
Workflow
- Deploy database monitoring agents on all production, staging, and development databases.
- Configure health checks: availability, response time, connection pool status, replication health.
- Enable slow query logging with threshold configured per database type and workload.
- Build real-time dashboards: query performance, resource utilization, wait events, I/O metrics.
- Implement automated query analysis: execution plan review, index recommendations, query optimization suggestions.
- Configure alerts: slow queries, connection exhaustion, replication lag, disk space, deadlock detection.
- Conduct weekly performance reviews: identify trending issues before they become incidents.
- Perform monthly capacity planning: storage growth, connection trends, throughput projections.
- Execute quarterly optimization: index analysis, statistics updates, table maintenance, configuration tuning.
- Document performance baselines and optimization changes for all databases.
Database Health Monitoring
DATABASE HEALTH CHECK FRAMEWORK
================================
Availability Monitoring:
Check method: Frequency Alert Threshold Response
────────────────────── ──────────── ─────────────── ──────────
TCP port connectivity Every 10 sec Down > 30 sec Auto-failover + page
Database ping/query Every 15 sec Response > 5 sec Alert DBA team
Application connection Every 30 sec Failures > 10% Page on-call
Replication status Every 30 sec Lag > 60 sec Alert DBA
Read replica health Every 60 sec Any replica down Alert DBA
Database-specific health queries:
PostgreSQL:
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
SELECT * FROM pg_stat_replication; -- replication lag
SELECT datname, age(datfrozenxid) FROM pg_database; -- transaction ID wraparound risk
MySQL:
SHOW PROCESSLIST; -- active connections
SHOW SLAVE STATUS\G -- replication status
SHOW ENGINE INNODB STATUS\G -- InnoDB status
SQL Server:
SELECT * FROM sys.dm_os_ring_buffers; -- error ring buffer
SELECT * FROM sys.dm_os_wait_stats; -- wait statistics
SELECT * FROM sys.dm_os_performance_counters; -- performance counters
MongoDB:
db.serverStatus() -- comprehensive server status
db.currentOp() -- current operations
rs.status() -- replica set status
Resource Utilization Monitoring:
CPU:
Database CPU vs. Host CPU (identify if bottleneck is database or other processes)
Alert: database CPU > 80% sustained for 5 minutes
Target: < 70% average, < 90% peak
Common causes: unoptimized queries, missing indexes, excessive sorting, full table scans
Memory:
Buffer pool / cache hit ratio (target: > 99% for InnoDB, > 95% for PostgreSQL shared buffers)
Working set size vs. available memory
Alert: buffer pool hit ratio < 95% (indicates insufficient memory or query pattern issues)
Swap usage: any swap = performance degradation (alert immediately)
Connection memory: per-connection memory × active connections
Disk I/O:
IOPS (Input/Output Operations Per Second):
SSD: 3,000–50,000 IOPS typical (NVMe: 100,000+)
HDD: 100–200 IOPS typical
Cloud (AWS gp3): 3,000 baseline, scale to 16,000
Throughput:
SSD: 500 MB/s–3.5 GB/s (NVMe)
HDD: 100–200 MB/s
Cloud (AWS gp3): 125 MB/s baseline, scale to 1,000 MB/s
I/O wait:
Alert: iowait > 20% (disk cannot keep up with requests)
Target: < 10% iowait
Alert: disk latency > 10ms for SSD, > 20ms for HDD
Disk Space:
Tablespaces / data files growth rate
Index size vs. data size (ratio should be 20–40%; > 50% indicates over-indexing)
Temporary tablespace usage (spike indicates complex queries)
Undo / rollback segment usage
Alert: disk usage > 80% (warning), > 90% (critical — database can stop)
Auto-expand: ensure data files have auto-expand enabled (but monitor)
Network:
Database network throughput
Connection count and connection rate
Replication network bandwidth (for master-slave setups)
Latency between application server and database
Alert: network latency > 5ms for co-located, > 20ms for cross-AZ
Query Performance Analysis
SLOW QUERY ANALYSIS FRAMEWORK
===============================
Slow Query Log Configuration:
PostgreSQL (log_min_duration_statement):
# Log queries taking longer than 1 second
log_min_duration_statement = 1000 # milliseconds
# For high-performance databases, lower threshold:
log_min_duration_statement = 100 # for APIs requiring < 100ms response
MySQL (slow_query_log):
slow_query_log = ON
long_query_time = 1.0 # seconds
log_queries_not_using_indexes = ON # critical for optimization
slow_query_log_file = /var/log/mysql/slow-query.log
SQL Server (Extended Events / Query Store):
-- Enable Query Store
ALTER DATABASE [dbname] SET QUERY_STORE = ON;
-- Capture queries with duration > 1 second
execution_plan_stats = ON
MongoDB (Profiling):
db.setProfilingLevel(1, { slowms: 100 }) # profile queries > 100ms
Slow Query Analysis Process:
Step 1: Identify slow queries
- Review slow query log (aggregate by query pattern, not individual query)
- Group by query fingerprint (normalize parameters)
- Sort by: total time, average time, frequency
- Top 20 slow queries typically account for 80% of query time
Step 2: Analyze execution plan
- PostgreSQL: EXPLAIN ANALYZE <query>
- MySQL: EXPLAIN <query>
- SQL Server: Execution Plan (Actual or Estimated)
- MongoDB: db.collection.find().explain("executionStats")
Key execution plan indicators:
FULL TABLE SCAN: ⚠️ Scanning all rows — index needed
INDEX SCAN: ✅ Using index — good
INDEX SEEK: ✅ Best — direct row access
NESTED LOOP JOIN: ⚠️ May be slow for large tables
HASH JOIN: ✅ Efficient for large datasets
SORT: ⚠️ Sorting in memory — may spill to disk
TEMP TABLE: ⚠️ Creating temporary table — memory issue
FILESORT: ⚠️ Sorting on disk — performance hit
Step 3: Optimization strategies
Missing indexes:
- Add index on WHERE clause columns
- Add composite index for multi-column WHERE
- Covering index (include SELECT columns to avoid table lookup)
- Example: CREATE INDEX idx_users_email_status ON users(email, status);
- Impact: 10x–1000x improvement for indexed queries
Query rewriting:
- Replace SELECT * with specific columns
- Avoid subqueries; use JOINs instead
- Use LIMIT for pagination (avoid OFFSET for deep pagination)
- Use EXISTS instead of IN for subqueries
- Batch operations instead of row-by-row
- Impact: 2x–10x improvement
Schema optimization:
- Normalize/denormalize based on access patterns
- Partition large tables (> 10M rows)
- Archive historical data to separate table
- Use appropriate data types (INT vs BIGINT, VARCHAR length)
- Impact: 2x–5x improvement
Connection optimization:
- Use connection pooling (PgBouncer, HikariCP, pgpool)
- Set appropriate pool size: (CPU cores × 2) + disk drives (formula)
- Connection timeout: 30 seconds idle
- Max connections per database server: 100–500 (depends on memory)
- Impact: reduces connection overhead by 50–90%
Step 4: Verification
- Re-run EXPLAIN ANALYZE after optimization
- Compare execution time before/after
- Run load test to validate under production load
- Monitor for 1 week to ensure no regression
Replication and High Availability
REPLICATION MONITORING FRAMEWORK
==================================
Replication types and monitoring:
PostgreSQL (Streaming Replication):
Monitor query:
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
replay_lag, write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
Key metrics:
replay_lag: Target < 1 second; Alert > 10 seconds; Critical > 60 seconds
state: Must be "streaming"; alert on "catchup" or disconnected
WAL sender process: Must be running on primary
WAL receiver process: Must be running on replica
Failure modes:
- Network partition: replica stops receiving WAL
- Disk full on replica: WAL cannot be written
- Version mismatch: primary upgraded, replica not yet
- Max WAL senders: all sender connections used
MySQL (Binlog Replication / Group Replication):
Monitor query:
SHOW SLAVE STATUS\G -- traditional replication
SELECT * FROM performance_schema.replication_group_members; -- group replication
Key metrics:
Seconds_Behind_Master: Target < 1 second; Alert > 30 seconds; Critical > 120 seconds
Slave_IO_Running: Must be "Yes"
Slave_SQL_Running: Must be "Yes"
Last_Errno: Must be 0 (no errors)
Relay_Log_Space: Track for growing gap
Group Replication (MySQL 8.0+):
- Multi-primary or single-primary mode
- Automatic conflict detection and resolution
- Monitor group membership and view changes
- Alert on member leaving group
SQL Server (AlwaysOn / Log Shipping):
Monitor:
SELECT * FROM sys.dm_hadr_availability_replica_states;
SELECT * FROM sys.dm_hadr_database_replica_states;
Key metrics:
Synchronization state: Must be "SYNCHRONIZED" or "SYNCHRONIZING"
Log send rate: Track for throughput issues
Redo queue size: Target < 1 MB; Alert > 10 MB
Redo rate: Must be > 0 (applying logs)
MongoDB (Replica Set):
Monitor:
rs.status()
Key metrics:
stateStr: Primary = "PRIMARY"; Secondaries = "SECONDARY"
health: Must be 1 (healthy)
optimeDate: Compare primary vs secondary for replication lag
electionDate: Track elections (frequent elections = stability issue)
heartbeatAvgMillis: Network latency between members
High Availability Architecture:
Active-Passive (most common):
- Primary handles all writes; standby on hot standby
- Failover time: 30 seconds–5 minutes (automatic with Patroni/PgBouncer for PostgreSQL)
- RPO: near-zero (streaming replication); some seconds of data loss possible
- Tools: Patroni (PostgreSQL), Orchestrator (MySQL), AlwaysOn (SQL Server)
Active-Active (advanced):
- Multiple writable nodes (requires conflict resolution)
- PostgreSQL: Citus, BDR, or logical replication
- MySQL: Group Replication, Galera Cluster
- MongoDB: sharded cluster (different pattern)
- Complexity: high; only use when write scalability required
Read Replicas (scaling reads):
- Primary handles writes; replicas handle reads
- Scale: 1 primary + 3–10 replicas (typical)
- Read/write split in application layer or proxy
- Replication lag: application must handle stale reads
- Tools: PgBouncer (connection pool), ProxySQL (MySQL routing), Pgpool-II
Cloud-managed HA:
- AWS RDS: Multi-AZ (automatic failover, 60–120 seconds)
- AWS Aurora: Multi-AZ with 5 copies across AZs (failover < 30 seconds)
- Azure SQL: Active Geo-Replica (regional redundancy)
- GCP Cloud SQL: Read replicas with automatic failover
- Managed databases typically cost 20–50% premium over standalone
Database Capacity Planning
DATABASE CAPACITY PLANNING
============================
Storage growth tracking:
Monthly measurement:
Database size: [X] GB (growth: [Y] GB/month, [Z]%/month)
Index size: [X] GB (ratio to data: [Y]%)
Temporary storage: [X] GB (peak during ETL/reporting)
Transaction logs: [X] GB/day
Backup storage: [X] GB (typically 1.5–3x database size)
Archive storage: [X] GB (historical data, compliance)
Growth projection (3 years):
Month Database Size Index Size Total Storage Cost/Month
────── ────────────── ────────── ────────────── ──────────
0 (now) 500 GB 150 GB 800 GB $800
6 600 GB 180 GB 960 GB $960
12 720 GB 216 GB 1,152 GB $1,152
24 1,000 GB 300 GB 1,600 GB $1,600
36 1,380 GB 414 GB 2,200 GB $2,200
Action triggers:
- At 70% capacity: begin planning for scaling
- At 80% capacity: execute scaling plan
- At 90% capacity: emergency scaling (add storage immediately)
Scaling strategies:
Vertical scaling (bigger server):
- Increase CPU, memory, storage on existing server
- Pros: simplest, no application changes
- Cons: expensive, downtime for some operations, ceiling on size
- Cost: 2x CPU = 2–3x cost; 2x memory = 1.5–2x cost
Horizontal scaling (more servers):
- Add read replicas for read scaling
- Sharding for write scaling
- Database federation for domain-based splitting
- Pros: virtually unlimited scale
- Cons: complex, application changes needed, data consistency challenges
- Cost: linear scaling, but 30–50% overhead for complexity
Storage optimization:
- Partition tables (by date, region, tenant)
- Archive old data (move to cold storage)
- Compress tables (PostgreSQL: ALTER TABLE ... SET COMPRESSION)
- Table partitioning: split by month/quarter for time-series data
- Impact: 30–70% storage reduction with proper archival
Connection capacity planning:
Current state:
Max connections configured: [X]
Average active connections: [Y]
Peak active connections: [Z]
Connection pool utilization: [Y/Z] × 100%
Connection sizing formula:
Optimal pool size = (CPU cores × 2) + disk drives
Example: 8-core server with SSD = (8 × 2) + 1 = 17 connections per pool
With 4 application servers = 17 × 4 = 68 total connections
Alert thresholds:
- Active connections > 80% of max: WARNING
- Active connections > 90% of max: CRITICAL (new connections will be rejected)
- Connection wait time > 1 second: PERFORMANCE ISSUE
Throughput capacity planning:
Current throughput:
Queries per second (QPS): [X] average, [Y] peak
Transactions per second (TPS): [Z] average, [W] peak
Data ingestion rate: [A] MB/hour
Scaling thresholds:
- QPS approaching 80% of tested capacity: plan scaling
- Response time (p95) > 2× baseline: investigate immediately
- IOPS at 80% of provisioned: increase IOPS or optimize queries
Integration Points
- Prometheus + Grafana: Open-source monitoring stack; MySQL/PostgreSQL exporters; custom dashboards; alerting via Alertmanager
- Datadog Database Monitoring: Agent-based monitoring for PostgreSQL, MySQL, SQL Server, MongoDB; automated dashboards; APM integration
- Percona Monitoring and Management (PMM): Open-source database monitoring; Query Analytics; Performance Schema integration
- pgAdmin / DBeaver: Database management tools with built-in monitoring; query execution plans; server status
- Cloud-native monitoring (AWS CloudWatch RDS, Azure Monitor SQL, GCP Cloud SQL): Managed database metrics; automated scaling; cost tracking
- New Relic Database Performance: Deep database monitoring; query tracing; integration with application performance data
- SolarWinds Database Performance Analyzer: Query performance; wait statistics; capacity planning; trend analysis
- Site24x7 Database Monitoring: Multi-database support; automated alerts; SLA reporting
Edge Cases
- Transaction ID wraparound (PostgreSQL specific): PostgreSQL uses 32-bit transaction IDs; wraparound risk when database age approaches 2 billion; vacuum cannot keep up; database will refuse connections to prevent data corruption
- Monitor: SELECT datname, age(datfrozenxid) FROM pg_database;
- Safe threshold: age < 1 billion; Alert: > 1.5 billion; Critical: > 2 billion
- Fix: VACUUM FREEZE on affected databases; ensure autovacuum is properly configured
- Prevention: autovacuum_freeze_max_age = 150,000,000 (default); monitor on high-write databases
- Database deadlocks (concurrent transactions blocking each other): Deadlocks cause transaction failures and application errors; monitor deadlock frequency; analyze deadlock graphs to identify root cause; implement application-level deadlock prevention (consistent lock ordering, shorter transactions)
- PostgreSQL: SELECT * FROM pg_locks WHERE NOT granted; -- identify blocking chains
- MySQL: SHOW ENGINE INNODB STATUS\G -- view latest deadlock
- SQL Server: sys.dm_tran_locks; extended events for deadlock graph
- Target: < 1 deadlock per day; > 5/day requires code review
- Cross-region database replication (global applications): Replication lag increases with geographic distance; typical lag: 100–500ms per continent hop; application must handle stale reads; consider multi-master with conflict resolution; use read-your-writes consistency for critical paths
- AWS DMS/Aurora Global Database: < 1 second cross-region lag
- Application pattern: route writes to primary region; reads can go to any replica
- Data consistency: eventual consistency acceptable for most reads; strong consistency for financial/auth operations
- NoSQL database monitoring (MongoDB, DynamoDB, Cassandra, Redis): Different monitoring model than RDBMS; monitor cluster health, shard distribution, replication factor, TTL expiry; DynamoDB: monitor read/write capacity utilization, throttling events; Redis: monitor memory usage, eviction rate, hit/miss ratio
- MongoDB: db.serverStatus().ops counts; connection pool metrics; shard balancer activity
- DynamoDB: ConsumedCapacity, ReadCapacityUnits, WriteCapacityUnits, ThrottledRequests
- Redis: INFO command; memory usage > 80% triggers eviction; monitor key expiration patterns
- Database performance under cloud burst (serverless/spike workloads): AWS Aurora Serverless, Azure SQL Serverless, Google Cloud Spanner scale automatically; monitor scale-up/scale-down frequency (cost implication); set min/max capacity to prevent runaway costs; cold start latency on scale-down (5–30 seconds)
- Aurora Serverless v2: min 0.5 ACU, max 128 ACU; scale in 5 seconds
- Cost monitoring: track ACU-hours; alert on unexpected scale-up events
- Cold start mitigation: keep minimum capacity at expected baseline load
- Legacy database migration performance (moving from on-prem to cloud or upgrading version): Performance often degrades post-migration due to different execution plans, statistics, or configuration; baseline performance before migration; compare after migration; adjust configuration (memory allocation, I/O settings); update statistics; review execution plans for major differences
- Pre-migration: capture current performance baselines (top 20 slow queries, throughput, latency)
- Post-migration: compare within 24 hours; adjust within 1 week; full optimization within 1 month
- Common issues: different optimizer behavior, missing indexes, configuration defaults, network latency differences