IT AI Skill

Database Performance Monitoring

Monitor database health, query performance, resource utilization, replication, and capacity. Use when diagnosing slow queries, monitoring database availability, tracking replication lag, optimizing database performance, planning database capacity, managing connection pools, or investigating database incidents. Triggers on phrases like "database performance", "slow query", "query optimization", "database monitoring", "replication lag", "connection pool", "database capacity", "database health", "query plan", "database alerting".

Database Performance Monitoring

Comprehensive monitoring and optimization of database infrastructure across all environments.

Workflow

Deploy database monitoring agents on all production, staging, and development databases.
Configure health checks: availability, response time, connection pool status, replication health.
Enable slow query logging with threshold configured per database type and workload.
Build real-time dashboards: query performance, resource utilization, wait events, I/O metrics.
Implement automated query analysis: execution plan review, index recommendations, query optimization suggestions.
Configure alerts: slow queries, connection exhaustion, replication lag, disk space, deadlock detection.
Conduct weekly performance reviews: identify trending issues before they become incidents.
Perform monthly capacity planning: storage growth, connection trends, throughput projections.
Execute quarterly optimization: index analysis, statistics updates, table maintenance, configuration tuning.
Document performance baselines and optimization changes for all databases.

Database Health Monitoring

DATABASE HEALTH CHECK FRAMEWORK
================================

Availability Monitoring:

  Check method:              Frequency      Alert Threshold     Response
  ──────────────────────    ────────────    ───────────────     ──────────
  TCP port connectivity     Every 10 sec    Down > 30 sec       Auto-failover + page
  Database ping/query       Every 15 sec    Response > 5 sec    Alert DBA team
  Application connection    Every 30 sec    Failures > 10%      Page on-call
  Replication status        Every 30 sec    Lag > 60 sec        Alert DBA
  Read replica health       Every 60 sec    Any replica down    Alert DBA

  Database-specific health queries:

  PostgreSQL:
    SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
    SELECT * FROM pg_stat_replication;  -- replication lag
    SELECT datname, age(datfrozenxid) FROM pg_database;  -- transaction ID wraparound risk

  MySQL:
    SHOW PROCESSLIST;  -- active connections
    SHOW SLAVE STATUS\G  -- replication status
    SHOW ENGINE INNODB STATUS\G  -- InnoDB status

  SQL Server:
    SELECT * FROM sys.dm_os_ring_buffers;  -- error ring buffer
    SELECT * FROM sys.dm_os_wait_stats;  -- wait statistics
    SELECT * FROM sys.dm_os_performance_counters;  -- performance counters

  MongoDB:
    db.serverStatus()  -- comprehensive server status
    db.currentOp()     -- current operations
    rs.status()        -- replica set status

Resource Utilization Monitoring:

  CPU:
    Database CPU vs. Host CPU (identify if bottleneck is database or other processes)
    Alert: database CPU > 80% sustained for 5 minutes
    Target: < 70% average, < 90% peak
    Common causes: unoptimized queries, missing indexes, excessive sorting, full table scans

  Memory:
    Buffer pool / cache hit ratio (target: > 99% for InnoDB, > 95% for PostgreSQL shared buffers)
    Working set size vs. available memory
    Alert: buffer pool hit ratio < 95% (indicates insufficient memory or query pattern issues)
    Swap usage: any swap = performance degradation (alert immediately)
    Connection memory: per-connection memory × active connections

  Disk I/O:
    IOPS (Input/Output Operations Per Second):
      SSD: 3,000–50,000 IOPS typical (NVMe: 100,000+)
      HDD: 100–200 IOPS typical
      Cloud (AWS gp3): 3,000 baseline, scale to 16,000
    Throughput:
      SSD: 500 MB/s–3.5 GB/s (NVMe)
      HDD: 100–200 MB/s
      Cloud (AWS gp3): 125 MB/s baseline, scale to 1,000 MB/s
    I/O wait:
      Alert: iowait > 20% (disk cannot keep up with requests)
      Target: < 10% iowait
    Alert: disk latency > 10ms for SSD, > 20ms for HDD

  Disk Space:
    Tablespaces / data files growth rate
    Index size vs. data size (ratio should be 20–40%; > 50% indicates over-indexing)
    Temporary tablespace usage (spike indicates complex queries)
    Undo / rollback segment usage
    Alert: disk usage > 80% (warning), > 90% (critical — database can stop)
    Auto-expand: ensure data files have auto-expand enabled (but monitor)

  Network:
    Database network throughput
    Connection count and connection rate
    Replication network bandwidth (for master-slave setups)
    Latency between application server and database
    Alert: network latency > 5ms for co-located, > 20ms for cross-AZ

Query Performance Analysis

SLOW QUERY ANALYSIS FRAMEWORK
===============================

Slow Query Log Configuration:

  PostgreSQL (log_min_duration_statement):
    # Log queries taking longer than 1 second
    log_min_duration_statement = 1000  # milliseconds
    # For high-performance databases, lower threshold:
    log_min_duration_statement = 100   # for APIs requiring < 100ms response

  MySQL (slow_query_log):
    slow_query_log = ON
    long_query_time = 1.0    # seconds
    log_queries_not_using_indexes = ON   # critical for optimization
    slow_query_log_file = /var/log/mysql/slow-query.log

  SQL Server (Extended Events / Query Store):
    -- Enable Query Store
    ALTER DATABASE [dbname] SET QUERY_STORE = ON;
    -- Capture queries with duration > 1 second
    execution_plan_stats = ON

  MongoDB (Profiling):
    db.setProfilingLevel(1, { slowms: 100 })  # profile queries > 100ms

Slow Query Analysis Process:

  Step 1: Identify slow queries
    - Review slow query log (aggregate by query pattern, not individual query)
    - Group by query fingerprint (normalize parameters)
    - Sort by: total time, average time, frequency
    - Top 20 slow queries typically account for 80% of query time

  Step 2: Analyze execution plan
    - PostgreSQL: EXPLAIN ANALYZE <query>
    - MySQL: EXPLAIN <query>
    - SQL Server: Execution Plan (Actual or Estimated)
    - MongoDB: db.collection.find().explain("executionStats")

  Key execution plan indicators:
    FULL TABLE SCAN:    ⚠️  Scanning all rows — index needed
    INDEX SCAN:         ✅  Using index — good
    INDEX SEEK:         ✅  Best — direct row access
    NESTED LOOP JOIN:   ⚠️  May be slow for large tables
    HASH JOIN:          ✅  Efficient for large datasets
    SORT:               ⚠️  Sorting in memory — may spill to disk
    TEMP TABLE:         ⚠️  Creating temporary table — memory issue
    FILESORT:           ⚠️  Sorting on disk — performance hit

  Step 3: Optimization strategies

    Missing indexes:
      - Add index on WHERE clause columns
      - Add composite index for multi-column WHERE
      - Covering index (include SELECT columns to avoid table lookup)
      - Example: CREATE INDEX idx_users_email_status ON users(email, status);
      - Impact: 10x–1000x improvement for indexed queries

    Query rewriting:
      - Replace SELECT * with specific columns
      - Avoid subqueries; use JOINs instead
      - Use LIMIT for pagination (avoid OFFSET for deep pagination)
      - Use EXISTS instead of IN for subqueries
      - Batch operations instead of row-by-row
      - Impact: 2x–10x improvement

    Schema optimization:
      - Normalize/denormalize based on access patterns
      - Partition large tables (> 10M rows)
      - Archive historical data to separate table
      - Use appropriate data types (INT vs BIGINT, VARCHAR length)
      - Impact: 2x–5x improvement

    Connection optimization:
      - Use connection pooling (PgBouncer, HikariCP, pgpool)
      - Set appropriate pool size: (CPU cores × 2) + disk drives (formula)
      - Connection timeout: 30 seconds idle
      - Max connections per database server: 100–500 (depends on memory)
      - Impact: reduces connection overhead by 50–90%

  Step 4: Verification
    - Re-run EXPLAIN ANALYZE after optimization
    - Compare execution time before/after
    - Run load test to validate under production load
    - Monitor for 1 week to ensure no regression

Replication and High Availability

REPLICATION MONITORING FRAMEWORK
==================================

Replication types and monitoring:

  PostgreSQL (Streaming Replication):
    Monitor query:
      SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
             replay_lag, write_lag, flush_lag, replay_lag
      FROM pg_stat_replication;

    Key metrics:
      replay_lag:            Target < 1 second; Alert > 10 seconds; Critical > 60 seconds
      state:                 Must be "streaming"; alert on "catchup" or disconnected
      WAL sender process:    Must be running on primary
      WAL receiver process:  Must be running on replica

    Failure modes:
      - Network partition: replica stops receiving WAL
      - Disk full on replica: WAL cannot be written
      - Version mismatch: primary upgraded, replica not yet
      - Max WAL senders: all sender connections used

  MySQL (Binlog Replication / Group Replication):
    Monitor query:
      SHOW SLAVE STATUS\G  -- traditional replication
      SELECT * FROM performance_schema.replication_group_members;  -- group replication

    Key metrics:
      Seconds_Behind_Master: Target < 1 second; Alert > 30 seconds; Critical > 120 seconds
      Slave_IO_Running:      Must be "Yes"
      Slave_SQL_Running:     Must be "Yes"
      Last_Errno:            Must be 0 (no errors)
      Relay_Log_Space:       Track for growing gap

    Group Replication (MySQL 8.0+):
      - Multi-primary or single-primary mode
      - Automatic conflict detection and resolution
      - Monitor group membership and view changes
      - Alert on member leaving group

  SQL Server (AlwaysOn / Log Shipping):
    Monitor:
      SELECT * FROM sys.dm_hadr_availability_replica_states;
      SELECT * FROM sys.dm_hadr_database_replica_states;

    Key metrics:
      Synchronization state:  Must be "SYNCHRONIZED" or "SYNCHRONIZING"
      Log send rate:          Track for throughput issues
      Redo queue size:        Target < 1 MB; Alert > 10 MB
      Redo rate:             Must be > 0 (applying logs)

  MongoDB (Replica Set):
    Monitor:
      rs.status()

    Key metrics:
      stateStr:               Primary = "PRIMARY"; Secondaries = "SECONDARY"
      health:                 Must be 1 (healthy)
      optimeDate:             Compare primary vs secondary for replication lag
      electionDate:           Track elections (frequent elections = stability issue)
      heartbeatAvgMillis:     Network latency between members

High Availability Architecture:

  Active-Passive (most common):
    - Primary handles all writes; standby on hot standby
    - Failover time: 30 seconds–5 minutes (automatic with Patroni/PgBouncer for PostgreSQL)
    - RPO: near-zero (streaming replication); some seconds of data loss possible
    - Tools: Patroni (PostgreSQL), Orchestrator (MySQL), AlwaysOn (SQL Server)

  Active-Active (advanced):
    - Multiple writable nodes (requires conflict resolution)
    - PostgreSQL: Citus, BDR, or logical replication
    - MySQL: Group Replication, Galera Cluster
    - MongoDB: sharded cluster (different pattern)
    - Complexity: high; only use when write scalability required

  Read Replicas (scaling reads):
    - Primary handles writes; replicas handle reads
    - Scale: 1 primary + 3–10 replicas (typical)
    - Read/write split in application layer or proxy
    - Replication lag: application must handle stale reads
    - Tools: PgBouncer (connection pool), ProxySQL (MySQL routing), Pgpool-II

  Cloud-managed HA:
    - AWS RDS: Multi-AZ (automatic failover, 60–120 seconds)
    - AWS Aurora: Multi-AZ with 5 copies across AZs (failover < 30 seconds)
    - Azure SQL: Active Geo-Replica (regional redundancy)
    - GCP Cloud SQL: Read replicas with automatic failover
    - Managed databases typically cost 20–50% premium over standalone

Database Capacity Planning

DATABASE CAPACITY PLANNING
============================

Storage growth tracking:

  Monthly measurement:
    Database size:           [X] GB (growth: [Y] GB/month, [Z]%/month)
    Index size:              [X] GB (ratio to data: [Y]%)
    Temporary storage:       [X] GB (peak during ETL/reporting)
    Transaction logs:        [X] GB/day
    Backup storage:          [X] GB (typically 1.5–3x database size)
    Archive storage:         [X] GB (historical data, compliance)

  Growth projection (3 years):

    Month    Database Size    Index Size    Total Storage    Cost/Month
    ──────   ──────────────   ──────────   ──────────────   ──────────
    0 (now)  500 GB           150 GB       800 GB           $800
    6        600 GB           180 GB       960 GB           $960
    12       720 GB           216 GB       1,152 GB         $1,152
    24       1,000 GB         300 GB       1,600 GB         $1,600
    36       1,380 GB         414 GB       2,200 GB         $2,200

  Action triggers:
    - At 70% capacity: begin planning for scaling
    - At 80% capacity: execute scaling plan
    - At 90% capacity: emergency scaling (add storage immediately)

  Scaling strategies:

    Vertical scaling (bigger server):
      - Increase CPU, memory, storage on existing server
      - Pros: simplest, no application changes
      - Cons: expensive, downtime for some operations, ceiling on size
      - Cost: 2x CPU = 2–3x cost; 2x memory = 1.5–2x cost

    Horizontal scaling (more servers):
      - Add read replicas for read scaling
      - Sharding for write scaling
      - Database federation for domain-based splitting
      - Pros: virtually unlimited scale
      - Cons: complex, application changes needed, data consistency challenges
      - Cost: linear scaling, but 30–50% overhead for complexity

    Storage optimization:
      - Partition tables (by date, region, tenant)
      - Archive old data (move to cold storage)
      - Compress tables (PostgreSQL: ALTER TABLE ... SET COMPRESSION)
      - Table partitioning: split by month/quarter for time-series data
      - Impact: 30–70% storage reduction with proper archival

Connection capacity planning:

  Current state:
    Max connections configured: [X]
    Average active connections: [Y]
    Peak active connections:    [Z]
    Connection pool utilization: [Y/Z] × 100%

  Connection sizing formula:
    Optimal pool size = (CPU cores × 2) + disk drives
    Example: 8-core server with SSD = (8 × 2) + 1 = 17 connections per pool
    With 4 application servers = 17 × 4 = 68 total connections

  Alert thresholds:
    - Active connections > 80% of max: WARNING
    - Active connections > 90% of max: CRITICAL (new connections will be rejected)
    - Connection wait time > 1 second: PERFORMANCE ISSUE

Throughput capacity planning:

  Current throughput:
    Queries per second (QPS):     [X] average, [Y] peak
    Transactions per second (TPS): [Z] average, [W] peak
    Data ingestion rate:           [A] MB/hour

  Scaling thresholds:
    - QPS approaching 80% of tested capacity: plan scaling
    - Response time (p95) > 2× baseline: investigate immediately
    - IOPS at 80% of provisioned: increase IOPS or optimize queries

Integration Points

Prometheus + Grafana: Open-source monitoring stack; MySQL/PostgreSQL exporters; custom dashboards; alerting via Alertmanager
Datadog Database Monitoring: Agent-based monitoring for PostgreSQL, MySQL, SQL Server, MongoDB; automated dashboards; APM integration
Percona Monitoring and Management (PMM): Open-source database monitoring; Query Analytics; Performance Schema integration
pgAdmin / DBeaver: Database management tools with built-in monitoring; query execution plans; server status
Cloud-native monitoring (AWS CloudWatch RDS, Azure Monitor SQL, GCP Cloud SQL): Managed database metrics; automated scaling; cost tracking
New Relic Database Performance: Deep database monitoring; query tracing; integration with application performance data
SolarWinds Database Performance Analyzer: Query performance; wait statistics; capacity planning; trend analysis
Site24x7 Database Monitoring: Multi-database support; automated alerts; SLA reporting

Edge Cases

Transaction ID wraparound (PostgreSQL specific): PostgreSQL uses 32-bit transaction IDs; wraparound risk when database age approaches 2 billion; vacuum cannot keep up; database will refuse connections to prevent data corruption
Monitor: SELECT datname, age(datfrozenxid) FROM pg_database;
Safe threshold: age < 1 billion; Alert: > 1.5 billion; Critical: > 2 billion
Fix: VACUUM FREEZE on affected databases; ensure autovacuum is properly configured
Prevention: autovacuum_freeze_max_age = 150,000,000 (default); monitor on high-write databases

Database deadlocks (concurrent transactions blocking each other): Deadlocks cause transaction failures and application errors; monitor deadlock frequency; analyze deadlock graphs to identify root cause; implement application-level deadlock prevention (consistent lock ordering, shorter transactions)
PostgreSQL: SELECT * FROM pg_locks WHERE NOT granted; -- identify blocking chains
MySQL: SHOW ENGINE INNODB STATUS\G -- view latest deadlock
SQL Server: sys.dm_tran_locks; extended events for deadlock graph
Target: < 1 deadlock per day; > 5/day requires code review

Cross-region database replication (global applications): Replication lag increases with geographic distance; typical lag: 100–500ms per continent hop; application must handle stale reads; consider multi-master with conflict resolution; use read-your-writes consistency for critical paths
AWS DMS/Aurora Global Database: < 1 second cross-region lag
Application pattern: route writes to primary region; reads can go to any replica
Data consistency: eventual consistency acceptable for most reads; strong consistency for financial/auth operations

NoSQL database monitoring (MongoDB, DynamoDB, Cassandra, Redis): Different monitoring model than RDBMS; monitor cluster health, shard distribution, replication factor, TTL expiry; DynamoDB: monitor read/write capacity utilization, throttling events; Redis: monitor memory usage, eviction rate, hit/miss ratio
MongoDB: db.serverStatus().ops counts; connection pool metrics; shard balancer activity
DynamoDB: ConsumedCapacity, ReadCapacityUnits, WriteCapacityUnits, ThrottledRequests
Redis: INFO command; memory usage > 80% triggers eviction; monitor key expiration patterns

Database performance under cloud burst (serverless/spike workloads): AWS Aurora Serverless, Azure SQL Serverless, Google Cloud Spanner scale automatically; monitor scale-up/scale-down frequency (cost implication); set min/max capacity to prevent runaway costs; cold start latency on scale-down (5–30 seconds)
Aurora Serverless v2: min 0.5 ACU, max 128 ACU; scale in 5 seconds
Cost monitoring: track ACU-hours; alert on unexpected scale-up events
Cold start mitigation: keep minimum capacity at expected baseline load

Legacy database migration performance (moving from on-prem to cloud or upgrading version): Performance often degrades post-migration due to different execution plans, statistics, or configuration; baseline performance before migration; compare after migration; adjust configuration (memory allocation, I/O settings); update statistics; review execution plans for major differences
Pre-migration: capture current performance baselines (top 20 slow queries, throughput, latency)
Post-migration: compare within 24 hours; adjust within 1 week; full optimization within 1 month
Common issues: different optimizer behavior, missing indexes, configuration defaults, network latency differences

Disclaimer: All rights reserved by Circulos AI. These skills are specifically designed for Claude Code, Claude Cowork, Codex, and OpenClaw. When using or referencing any skill, please provide proper attribution to Circulos AI.