IT AI Skill

Database Administration

Administer databases including performance tuning, backup/recovery, replication, high availability, schema management, and capacity planning. Use when optimizing database performance, managing backups, configuring replication, planning capacity, or handling...

Database Administration

Administer databases including performance tuning, backup/recovery, replication, high availability, and capacity planning.

Workflow

1. Database Performance Tuning

PERFORMANCE TUNING METHODOLOGY
═══════════════════════════════════════

DIAGNOSTIC PROCESS:
═══════════════════════════════════════

Step 1: Identify Slow Queries
  → Enable slow query log (threshold: 1 second)
  → Use EXPLAIN ANALYZE on top queries
  → Check query cache hit ratio
  → Monitor wait events

Step 2: Analyze Execution Plan
  → Full table scan? → Add index
  → Filesort? → Optimize ORDER BY
  → Temporary table? → Rewrite query
  → Lock wait? → Optimize transactions
  → Network wait? → Check connection pool

Step 3: Optimize
  → Add/modify indexes
  → Rewrite query (avoid SELECT *, subqueries)
  → Partition large tables
  → Materialize frequent aggregations
  → Tune connection pool

EXAMPLE: Slow Query Optimization
═══════════════════════════════════════

Before (3.2 seconds):
═══════════════════════════════════════

  SELECT * FROM orders o
  JOIN customers c ON o.customer_id = c.id
  WHERE o.created_at > '2024-01-01'
  AND c.country = 'US'
  ORDER BY o.total_amount DESC
  LIMIT 100;

EXPLAIN ANALYZE shows:
  → Sequential scan on orders (2.8M rows): 2,800ms
  → Sequential scan on customers (500K rows): 300ms
  → Sort operation: 100ms

After (45ms):
═══════════════════════════════════════

  -- Add composite index
  CREATE INDEX idx_orders_customer_created
    ON orders (customer_id, created_at, total_amount DESC);

  CREATE INDEX idx_customers_country
    ON customers (country);

  -- Optimized query
  SELECT o.id, o.total_amount, o.created_at, c.name
    FROM orders o
    JOIN customers c ON o.customer_id = c.id
    WHERE o.created_at > '2024-01-01'
    AND c.country = 'US'
    ORDER BY o.total_amount DESC
    LIMIT 100;

EXPLAIN ANALYZE shows:
  → Index scan on orders: 35ms
  → Index lookup on customers: 5ms
  → No sort (index already ordered): 5ms
  → Total: 45ms (71x improvement)

DATABASE HEALTH CHECK:
═══════════════════════════════════════

  Metric                              Current    Threshold    Status
  ──────────────────────────────────────────────────────────────────
  CPU utilization                     45%        <70%          ✓ OK
  Memory utilization                  62%        <80%          ✓ OK
  Disk IOPS                           800        <2000         ✓ OK
  Connection pool usage               65/100     <80           ✓ OK
  Replication lag                     200ms      <1s           ✓ OK
  Deadlocks (last 24h)                0          0             ✓ OK
  Slow queries (last 24h)             12         <50           ⚠ Review
  Table bloat                         8%         <15%          ✓ OK
  Index hit ratio                     99.2%      >95%          ✓ OK
  Buffer cache hit ratio              98.5%      >95%          ✓ OK

2. Backup & Recovery Strategy

BACKUP STRATEGY
═══════════════════════════════════════

Backup Type          Frequency    Retention    RPO      RTO
────────────────────────────────────────────────────────────────
Full backup          Weekly       4 weeks      24 hours  1 hour
Incremental          Daily        14 days      24 hours  30 min
WAL/Transaction log  Continuous   7 days       5 min     5 min
Point-in-time        Continuous   30 days      1 min     15 min

BACKUP SCHEDULE:
═══════════════════════════════════════

  Sunday 2:00 AM: Full backup (all databases)
  Monday-Saturday 2:00 AM: Incremental backup
  Continuous: WAL archiving to S3
  Every 15 minutes: Snapshot (automated)

RECOVERY PROCEDURES:
═══════════════════════════════════════

Scenario 1: Accidental Data Deletion
  → Restore from PITR to point before deletion
  → Extract specific tables/rows from restore
  → Apply to production (point-in-time recovery)
  → RTO: 15-30 minutes

Scenario 2: Corrupted Table
  → Restore table from latest full + incrementals
  → Verify data integrity
  → Swap tables (minimal downtime)
  → RTO: 1-2 hours

Scenario 3: Full Database Failure
  → Failover to replica (automatic if configured)
  → If replica unavailable: Restore from latest backup
  → Replay WAL to recovery point
  → RTO: 30 minutes (with replica), 2 hours (without)

Scenario 4: Disaster Recovery
  → Failover to DR region
  → Restore from cross-region backup
  → Update DNS routing
  → RTO: 4 hours

3. High Availability & Replication

HIGH AVAILABILITY ARCHITECTURE
═══════════════════════════════════════

PostgreSQL HA Setup:
═══════════════════════════════════════

Primary (Writer):
  → Instance: db.r6i.xlarge (4 vCPU, 32 GB)
  → Storage: 500 GB gp3 (3000 IOPS)
  → Role: Accepts read + write

Read Replicas (Readers):
  → 2x db.r6i.large (2 vCPU, 16 GB)
  → Storage: 500 GB gp3
  → Role: Read-only (auto-balanced)
  → Replication lag: <500ms

Connection Routing:
  → PgBouncer (connection pooler)
  → Write connections → Primary
  → Read connections → Replicas (round-robin)
  → Session persistence for transactions

Failover:
  → Patroni + etcd (automatic leader election)
  → Failover time: <30 seconds
  → Automatic DNS update (Route53)
  → Application retry logic (exponential backoff)

MongoDB Replica Set:
═══════════════════════════════════════

  Primary: 1x mongos + config servers
  Secondaries: 2x (synchronous replication)
  Arbiter: 1x (for odd-number voting)
  Write concern: majority (durability)
  Read preference: secondaryPreferred (scale reads)

4. Schema Management & Migration

SCHEMA VERSIONING STRATEGY
═══════════════════════════════════════

Tools: Flyway / Liquibase / Alembic

Migration Process:
═══════════════════════════════════════

  1. Developer creates migration script
     → Up: Apply change (add column, create table)
     → Down: Revert change (drop column, drop table)

  2. CI pipeline validates
     → Syntax check
     → Test against staging database
     → Performance impact assessment

  3. Deploy to staging
     → Apply migration
     → Run integration tests
     → Verify no regressions

  4. Deploy to production
     → Apply during maintenance window (if downtime)
     → Use zero-downtime migration patterns (expand/contract)
     → Monitor for errors

ZERO-DOWNTIME MIGRATION PATTERNS:
═══════════════════════════════════════

Pattern 1: Add Column (backward compatible)
  → Add new column (nullable)
  → Deploy code that reads both old and new
  → Backfill new column from old data
  → Deploy code using only new column
  → Remove old column

Pattern 2: Change Column Type
  → Add new column with new type
  → Dual-write (write to both columns)
  → Backfill new column
  → Switch reads to new column
  → Stop dual-write
  → Drop old column

Pattern 3: Rename Table
  → Create new table with new name
  → Dual-write to both tables
  → Backfill new table
  → Switch reads to new table
  → Drop old table

5. Capacity Planning

CAPACITY PLANNING
═══════════════════════════════════════

GROWTH ANALYSIS:
═══════════════════════════════════════

Metric                Current     Growth Rate    6-Month    12-Month   Threshold
                    (Monthly)   (Monthly)     Projection  Projection
────────────────────────────────────────────────────────────────────────────────
Database size       250 GB      5%             336 GB      438 GB     500 GB ⚠
Daily transactions  2.5M        8%             4.6M        8.4M       10M
Peak connections    65/100      6%             93/100      165/100    100 🔴
Avg query time      45ms        —              55ms        75ms       100ms
Storage IOPS        800/3000    7%             1,500       2,700      3,000 ⚠

CAPACITY ACTIONS:
═══════════════════════════════════════

Immediate (this quarter):
  → Scale up connection pool (100 → 200)
  → Add read replica (offload reporting queries)
  → Archive data older than 2 years (reduce size by 40%)

6-Month:
  → Upgrade to larger instance (r6i.xlarge → r6i.2xlarge)
  → Implement table partitioning (orders table by date)
  → Optimize top 10 slow queries

12-Month:
  → Evaluate database sharding (if single DB becomes bottleneck)
  → Migrate to cloud-native database (Aurora/Cloud Spanner)
  → Implement query result caching (Redis)

Edge Cases

Integration Points

Output

Database Status

DATABASE STATUS — Production
═══════════════════════════════════════

Health: GOOD (all metrics within thresholds)
Performance: Avg 45ms query time, 99.2% index hit ratio
HA: Primary + 2 replicas, failover <30s
Backup: Full weekly + continuous WAL archiving
Capacity: Connection pool at 65% — scale by Q2

Actions:
  → Optimize 12 slow queries (identified)
  → Archive 2+ year data (free 100GB)
  → Scale connection pool (100 → 200)