---
name: database-administration
description: Administer databases including performance tuning, backup/recovery, replication, high availability, schema management, and capacity planning. Use when optimizing database performance, managing backups, configuring replication, planning capacity, or handling database incidents. Triggers on phrases like "database administration", "DBA", "performance tuning", "query optimization", "slow query", "index optimization", "backup strategy", "replication", "high availability", "database migration", "capacity planning", "connection pooling", "deadlock", "table partitioning", "vacuum", "reindex".
---

# Database Administration

Administer databases including performance tuning, backup/recovery, replication, high availability, and capacity planning.

## Workflow

### 1. Database Performance Tuning

```
PERFORMANCE TUNING METHODOLOGY
═══════════════════════════════════════

DIAGNOSTIC PROCESS:
═══════════════════════════════════════

Step 1: Identify Slow Queries
  → Enable slow query log (threshold: 1 second)
  → Use EXPLAIN ANALYZE on top queries
  → Check query cache hit ratio
  → Monitor wait events

Step 2: Analyze Execution Plan
  → Full table scan? → Add index
  → Filesort? → Optimize ORDER BY
  → Temporary table? → Rewrite query
  → Lock wait? → Optimize transactions
  → Network wait? → Check connection pool

Step 3: Optimize
  → Add/modify indexes
  → Rewrite query (avoid SELECT *, subqueries)
  → Partition large tables
  → Materialize frequent aggregations
  → Tune connection pool

EXAMPLE: Slow Query Optimization
═══════════════════════════════════════

Before (3.2 seconds):
═══════════════════════════════════════

  SELECT * FROM orders o
  JOIN customers c ON o.customer_id = c.id
  WHERE o.created_at > '2024-01-01'
  AND c.country = 'US'
  ORDER BY o.total_amount DESC
  LIMIT 100;

EXPLAIN ANALYZE shows:
  → Sequential scan on orders (2.8M rows): 2,800ms
  → Sequential scan on customers (500K rows): 300ms
  → Sort operation: 100ms

After (45ms):
═══════════════════════════════════════

  -- Add composite index
  CREATE INDEX idx_orders_customer_created
    ON orders (customer_id, created_at, total_amount DESC);

  CREATE INDEX idx_customers_country
    ON customers (country);

  -- Optimized query
  SELECT o.id, o.total_amount, o.created_at, c.name
    FROM orders o
    JOIN customers c ON o.customer_id = c.id
    WHERE o.created_at > '2024-01-01'
    AND c.country = 'US'
    ORDER BY o.total_amount DESC
    LIMIT 100;

EXPLAIN ANALYZE shows:
  → Index scan on orders: 35ms
  → Index lookup on customers: 5ms
  → No sort (index already ordered): 5ms
  → Total: 45ms (71x improvement)

DATABASE HEALTH CHECK:
═══════════════════════════════════════

  Metric                              Current    Threshold    Status
  ──────────────────────────────────────────────────────────────────
  CPU utilization                     45%        <70%          ✓ OK
  Memory utilization                  62%        <80%          ✓ OK
  Disk IOPS                           800        <2000         ✓ OK
  Connection pool usage               65/100     <80           ✓ OK
  Replication lag                     200ms      <1s           ✓ OK
  Deadlocks (last 24h)                0          0             ✓ OK
  Slow queries (last 24h)             12         <50           ⚠ Review
  Table bloat                         8%         <15%          ✓ OK
  Index hit ratio                     99.2%      >95%          ✓ OK
  Buffer cache hit ratio              98.5%      >95%          ✓ OK
```

### 2. Backup & Recovery Strategy

```
BACKUP STRATEGY
═══════════════════════════════════════

Backup Type          Frequency    Retention    RPO      RTO
────────────────────────────────────────────────────────────────
Full backup          Weekly       4 weeks      24 hours  1 hour
Incremental          Daily        14 days      24 hours  30 min
WAL/Transaction log  Continuous   7 days       5 min     5 min
Point-in-time        Continuous   30 days      1 min     15 min

BACKUP SCHEDULE:
═══════════════════════════════════════

  Sunday 2:00 AM: Full backup (all databases)
  Monday-Saturday 2:00 AM: Incremental backup
  Continuous: WAL archiving to S3
  Every 15 minutes: Snapshot (automated)

RECOVERY PROCEDURES:
═══════════════════════════════════════

Scenario 1: Accidental Data Deletion
  → Restore from PITR to point before deletion
  → Extract specific tables/rows from restore
  → Apply to production (point-in-time recovery)
  → RTO: 15-30 minutes

Scenario 2: Corrupted Table
  → Restore table from latest full + incrementals
  → Verify data integrity
  → Swap tables (minimal downtime)
  → RTO: 1-2 hours

Scenario 3: Full Database Failure
  → Failover to replica (automatic if configured)
  → If replica unavailable: Restore from latest backup
  → Replay WAL to recovery point
  → RTO: 30 minutes (with replica), 2 hours (without)

Scenario 4: Disaster Recovery
  → Failover to DR region
  → Restore from cross-region backup
  → Update DNS routing
  → RTO: 4 hours
```

### 3. High Availability & Replication

```
HIGH AVAILABILITY ARCHITECTURE
═══════════════════════════════════════

PostgreSQL HA Setup:
═══════════════════════════════════════

Primary (Writer):
  → Instance: db.r6i.xlarge (4 vCPU, 32 GB)
  → Storage: 500 GB gp3 (3000 IOPS)
  → Role: Accepts read + write

Read Replicas (Readers):
  → 2x db.r6i.large (2 vCPU, 16 GB)
  → Storage: 500 GB gp3
  → Role: Read-only (auto-balanced)
  → Replication lag: <500ms

Connection Routing:
  → PgBouncer (connection pooler)
  → Write connections → Primary
  → Read connections → Replicas (round-robin)
  → Session persistence for transactions

Failover:
  → Patroni + etcd (automatic leader election)
  → Failover time: <30 seconds
  → Automatic DNS update (Route53)
  → Application retry logic (exponential backoff)

MongoDB Replica Set:
═══════════════════════════════════════

  Primary: 1x mongos + config servers
  Secondaries: 2x (synchronous replication)
  Arbiter: 1x (for odd-number voting)
  Write concern: majority (durability)
  Read preference: secondaryPreferred (scale reads)
```

### 4. Schema Management & Migration

```
SCHEMA VERSIONING STRATEGY
═══════════════════════════════════════

Tools: Flyway / Liquibase / Alembic

Migration Process:
═══════════════════════════════════════

  1. Developer creates migration script
     → Up: Apply change (add column, create table)
     → Down: Revert change (drop column, drop table)

  2. CI pipeline validates
     → Syntax check
     → Test against staging database
     → Performance impact assessment

  3. Deploy to staging
     → Apply migration
     → Run integration tests
     → Verify no regressions

  4. Deploy to production
     → Apply during maintenance window (if downtime)
     → Use zero-downtime migration patterns (expand/contract)
     → Monitor for errors

ZERO-DOWNTIME MIGRATION PATTERNS:
═══════════════════════════════════════

Pattern 1: Add Column (backward compatible)
  → Add new column (nullable)
  → Deploy code that reads both old and new
  → Backfill new column from old data
  → Deploy code using only new column
  → Remove old column

Pattern 2: Change Column Type
  → Add new column with new type
  → Dual-write (write to both columns)
  → Backfill new column
  → Switch reads to new column
  → Stop dual-write
  → Drop old column

Pattern 3: Rename Table
  → Create new table with new name
  → Dual-write to both tables
  → Backfill new table
  → Switch reads to new table
  → Drop old table
```

### 5. Capacity Planning

```
CAPACITY PLANNING
═══════════════════════════════════════

GROWTH ANALYSIS:
═══════════════════════════════════════

Metric                Current     Growth Rate    6-Month    12-Month   Threshold
                    (Monthly)   (Monthly)     Projection  Projection
────────────────────────────────────────────────────────────────────────────────
Database size       250 GB      5%             336 GB      438 GB     500 GB ⚠
Daily transactions  2.5M        8%             4.6M        8.4M       10M
Peak connections    65/100      6%             93/100      165/100    100 🔴
Avg query time      45ms        —              55ms        75ms       100ms
Storage IOPS        800/3000    7%             1,500       2,700      3,000 ⚠

CAPACITY ACTIONS:
═══════════════════════════════════════

Immediate (this quarter):
  → Scale up connection pool (100 → 200)
  → Add read replica (offload reporting queries)
  → Archive data older than 2 years (reduce size by 40%)

6-Month:
  → Upgrade to larger instance (r6i.xlarge → r6i.2xlarge)
  → Implement table partitioning (orders table by date)
  → Optimize top 10 slow queries

12-Month:
  → Evaluate database sharding (if single DB becomes bottleneck)
  → Migrate to cloud-native database (Aurora/Cloud Spanner)
  → Implement query result caching (Redis)
```

## Edge Cases

- **Multi-region**: Cross-region replication, conflict resolution
- **Data residency**: GDPR, data localization requirements
- **High write volume**: Write-ahead log optimization, batching
- **Large tables**: Partitioning, archiving, materialized views
- **Mixed workloads**: OLTP vs OLAP separation

## Integration Points

- **Databases**: PostgreSQL, MySQL, MongoDB, Redis, Elasticsearch
- **Monitoring**: Prometheus, Grafana, Datadog, pg_stat_statements
- **Backup**: pgBackRest, mongodump, AWS RDS snapshots
- **Orchestration**: Patroni, Kubernetes Operators
- **CI/CD**: Flyway, Liquibase, GitHub Actions
- **Cloud**: AWS RDS, Azure SQL, GCP Cloud SQL

## Output

### Database Status

```
DATABASE STATUS — Production
═══════════════════════════════════════

Health: GOOD (all metrics within thresholds)
Performance: Avg 45ms query time, 99.2% index hit ratio
HA: Primary + 2 replicas, failover <30s
Backup: Full weekly + continuous WAL archiving
Capacity: Connection pool at 65% — scale by Q2

Actions:
  → Optimize 12 slow queries (identified)
  → Archive 2+ year data (free 100GB)
  → Scale connection pool (100 → 200)
```