---
name: database-operations
description: Manage database operations including performance tuning, backup and recovery, high availability, replication, schema management, database migrations, capacity planning, and database security. Use when optimizing database performance, managing database backups, configuring high availability, planning database migrations, or ensuring database security. Triggers on phrases like "database operations", "DBA", "database performance", "query optimization", "database backup", "database recovery", "high availability", "replication", "database migration", "schema change", "database scaling", "database security", "PostgreSQL", "MySQL", "MongoDB", "Redis", "database monitoring", "connection pool", "deadlock", "slow query".
---

# Database Operations

Manage database performance, reliability, security, and capacity across all database platforms.

## Database Infrastructure

### Database Inventory & Architecture

```
DATABASE INFRASTRUCTURE:
════════════════════════

DATABASE INVENTORY:
  ┌──────────────────────────┬──────────┬──────────┬──────────────────┐
  │ Database                 │ Count    │ Type     │ Primary Use      │
  ├──────────────────────────┼──────────┼──────────┼──────────────────┤
  │ PostgreSQL               │ 12       │ RDBMS    │ Applications, CRM│
  │ MySQL                    │ 6        │ RDBMS    │ Web apps, CMS    │
  │ MongoDB                  │ 4        │ NoSQL    │ Document store   │
  │ Redis                    │ 4        │ Cache    │ Session, cache   │
  │ ────────────────────── │ ────── │ ────── │ ─────────────── │
  │ TOTAL                  │ 26     │ Mixed    │                  │
  └──────────────────────────┴──────────┴──────────┴──────────────────┘

  Deployment model:
    Managed cloud: 20 databases (AWS RDS, Azure SQL, Azure Cosmos DB)
    Self-managed: 4 databases (K8s pods — legacy, migration planned)
    Serverless: 2 databases (AWS Aurora Serverless, Azure Cosmos)
  
  Data classification:
    Production: 18 databases (critical — HA enabled)
    Staging: 4 databases (mirrors production — anonymized data)
    Development: 4 databases (local/dev — test data)

HIGH AVAILABILITY:
  Production databases (18):
    ┌──────────────────────────┬──────────┬──────────────────┬────────────┐
    │ HA Strategy              │ Count    │ RTO            │ RPO        │
    ├──────────────────────────┼──────────┼──────────────────┼────────────┤
    │ Multi-AZ (auto-failover) │ 12       │ <60 seconds      │ <5 sec     │
    │ Read replicas (3-5)      │ 12       │ N/A (read-only)  │ <30 sec    │
    │ Cross-region replication │ 8        │ <5 minutes       │ <1 min     │
    │ Point-in-time recovery   │ 18       │ <1 minute        │ <5 sec     │
    │ Automated backup         │ 18       │ <1 hour          │ <1 hour    │
    └──────────────────────────┴──────────┴──────────────────┴────────────┘

  HA testing:
    Chaos engineering: Quarterly (failover test — maintenance window)
    Last test: Q4 2024 (January 2025)
    Result: All 18 databases passed (failover <60 seconds, RPO met)
    Next test: Q1 2025 (April)
  
  Failover history (January 2025):
    Planned: 2 (maintenance — zero downtime)
    Unplanned: 0 (no incidents)
    Total failover time: 45 seconds (avg., planned)

CAPACITY OVERVIEW:
  ┌──────────────────────────┬──────────┬──────────┬──────────┐
  │ Metric                   │ Current  │ Max      │ Util %   │
  ├──────────────────────────┼──────────┼──────────┼──────────┤
  │ Total storage            │ 12 TB    │ 20 TB    │ 60%      │
  │ CPU (avg.)               │ 42%      │ 80%      │ 42%      │
  │ Memory (avg.)            │ 55%      │ 85%      │ 55%      │
  │ Connections (avg.)       │ 480      │ 1,000    │ 48%      │
  │ IOPS (avg.)              │ 8,500    │ 20,000   │ 42.5%    │
  │ ────────────────────── │ ────── │ ────── │ ────── │
  │ Status                 │ Healthy│ Healthy  │ Healthy  │
  └──────────────────────────┴──────────┴──────────┴──────────┘

  12-month projection:
    Storage: 12 TB → 18 TB (50% growth — add capacity at 15 TB)
    Connections: 480 → 720 (50% growth — monitor at 700)
    IOPS: 8,500 → 12,750 (50% growth — monitor at 12K)
  
  Scaling triggers:
    Storage: Alert at 80% (16 TB), scale at 85% (17 TB)
    CPU: Alert at 70%, scale at 80%
    Memory: Alert at 75%, scale at 85%
    Connections: Alert at 70%, scale at 80%
```

## Performance Management

### Query Optimization & Tuning

```
PERFORMANCE MANAGEMENT:
══════════════════════

MONITORING METRICS (Real-time):
  ┌──────────────────────────┬──────────┬──────────┬──────────┐
  │ Metric                   │ Warning  │ Critical │ Current  │
  ├──────────────────────────┼──────────┼──────────┼──────────┤
  │ CPU utilization          │ >70%     │ >85%     │ 42%      │
  │ Memory utilization       │ >75%     │ >90%     │ 55%      │
  │ Disk I/O latency         │ >10ms    │ >50ms    │ 4.2ms    │
  │ Connections used         │ >70%     │ >85%     │ 48%      │
  │ Replication lag          │ >10 sec  │ >60 sec  │ 0.8 sec  │
  │ Slow queries (>1s)       │ >5/hr    │ >20/hr   │ 2/hr     │
  │ Deadlocks                │ >0       │ >3       │ 0        │
  │ Cache hit rate           │ <95%     │ <90%     │ 98.5%    │
  │ Table bloat              │ >20%     │ >40%     │ 8%       │
  │ ────────────────────── │ ────── │ ────── │ ────── │
  │ Status                 │ —      │ —      │ Healthy  │
  └──────────────────────────┴──────────┴──────────┴──────────┘

  Monitoring tools:
    PostgreSQL: pg_stat_statements + Datadog DB monitor
    MySQL: Performance Schema + Datadog DB monitor
    MongoDB: DB Profiler + Datadog DB monitor
    Redis: INFO command + Datadog Redis monitor

QUERY OPTIMIZATION:
  Slow query threshold: >1 second (alert), >5 seconds (critical)
  
  Slow query analysis (January 2025):
    Total slow queries: 58 (2/hr avg.)
    ┌──────────────────────────┬──────────┬──────────┐
    │ Root Cause               │ Count    │ Status   │
    ├──────────────────────────┼──────────┼──────────┤
    │ Missing index            │ 18       │ 15 fixed │
    │ Suboptimal query plan    │ 12       │ 10 fixed │
    │ N+1 queries              │ 8        │ 6 fixed  │
    │ Large table scan         │ 10       │ 8 fixed  │
    │ Lock contention          │ 5        │ 5 fixed  │
    │ Application issue        │ 5        │ 4 fixed  │
    │ ────────────────────── │ ────── │ ────── │
    │ TOTAL                  │ 58     │ 48 fixed │
    └──────────────────────────┴──────────┴──────────┘

  Optimization techniques:
    1. Index optimization (add, remove, composite, partial)
    2. Query rewriting (JOIN vs. subquery, CTE optimization)
    3. Query plan analysis (EXPLAIN ANALYZE, force index)
    4. Table partitioning (large tables — by date, range)
    5. Materialized views (frequent aggregation, summary)
    6. Connection pooling (PgBouncer, ProxySQL — reduce overhead)
    7. Caching layer (Redis — reduce DB load)
    8. Read replicas (offload read queries)
  
  Performance improvement (quarterly trend):
    Q4 2024: Avg. query time 45ms → Q1 2025: 32ms (29% improvement)
    Slow queries: 12/hr → 2/hr (83% reduction)
    P99 latency: 250ms → 120ms (52% improvement)

CONNECTION MANAGEMENT:
  Connection pools:
    PgBouncer (PostgreSQL): 12 instances
    ProxySQL (MySQL): 6 instances
    Max connections per pool: 50
    Idle timeout: 300 seconds
    Health check: Every 30 seconds
  
  Connection statistics:
    Avg. active connections: 480 (of 1,000 max)
    Peak connections: 620 (during business hours)
    Connection wait time: 2.5ms (avg.)
    Connection errors: 0 (January 2025)
  
  Connection best practices:
    - Connection pooling (all applications)
    - Max connection limit (per app, per DB)
    - Idle connection timeout (prevent leak)
    - Connection validation (health check)
    - Retry logic (transient error handling)

SCHEMA MANAGEMENT:
  Schema versioning: Flyway (PostgreSQL/MySQL) + Mongomigrator (MongoDB)
  Migration workflow:
    1. Developer writes migration (up + down)
    2. CI pipeline: Validate migration (syntax, conflict check)
    3. Staging: Apply migration (test)
    4. Production: Apply migration (during maintenance window)
    5. Rollback: Down migration (if needed — tested)
  
  Migration policy:
    - Backward-compatible (zero downtime — expand-contract)
    - Data migration: Batch (large tables) + monitoring
    - DDL lock avoidance: Online DDL (PostgreSQL online schema change)
    - Review: DBA + lead engineer (every migration PR)
  
  Schema change statistics (January):
    Migrations applied: 28 (staging: 28, production: 25)
    Rollbacks: 0 (all successful)
    Avg. migration time: 45 seconds (production)
    Zero-downtime: 100% (backward-compatible)
```

## Backup & Recovery

### Data Protection

```
BACKUP & RECOVERY:
══════════════════

BACKUP STRATEGY:
  ┌──────────────────────────┬──────────┬──────────┬──────────┐
  │ Backup Type              │ Freq     │ Retention│ RPO      │
  ├──────────────────────────┼──────────┼──────────┼──────────┤
  │ Full backup              │ Daily    │ 30 days  │ 24 hrs   │
  │ Incremental              │ Hourly   │ 7 days   │ 1 hr     │
  │ WAL/transaction log      │ Continuous│ 7 days  │ <5 sec   │
  │ Point-in-time (PITR)     │ Continuous│ 35 days │ <5 sec   │
  │ Cross-region             │ Daily    │ 14 days  │ 24 hrs   │
  │ Snapshot (storage-level) │ Daily    │ 7 days   │ 1 hr     │
  └──────────────────────────┴──────────┴──────────┴──────────┘

  Backup coverage: 100% (all 26 databases)
  Backup success rate: 99.8% (target: >99%) ✓
  Backup verification: Weekly (test restore — staging)
  Backup encryption: AES-256 (at rest) + TLS (in transit)

BACKUP TESTING:
  Monthly restore test (all production databases):
    Test database: Restore to isolated environment
    Validation: Data integrity + application functionality
    Duration: 2-4 hours (per database, parallel)
    Result (January): 18/18 passed (100%)
  
  Recovery Time Objective (RTO):
    Full database restore: <1 hour (target: <2 hours) ✓
    Point-in-time recovery: <5 minutes (target: <10 min) ✓
    Cross-region recovery: <5 minutes (failover) ✓
    Table-level recovery: <10 minutes (from backup) ✓

DATA RETENTION:
  Production data: 7 years (compliance — financial, HR)
  Application data: 3 years (business requirement)
  Log data: 1 year (audit, troubleshooting)
  Temporary data: 90 days (auto-cleanup)
  
  Compliance requirements:
    Financial: 7 years (SOX, audit)
    HR: 7 years (employment law)
    Customer data: 3 years (privacy — GDPR)
    Application data: 3 years (business need)
  
  Data classification:
    PII: 12 databases (encrypted at rest + in transit)
    Financial: 6 databases (encrypted + audit logging)
    Operational: 8 databases (encrypted at rest)
```

## Database Security

### Access & Compliance

```
DATABASE SECURITY:
══════════════════

ACCESS CONTROL:
  Principle: Least privilege (per application, per user)
  
  Database users: 85 (application accounts) + 12 (DBA accounts)
  ┌──────────────────────────┬──────────┬──────────┐
  │ Access Type              │ Count    │ Review   │
  ├──────────────────────────┼──────────┼──────────┤
  │ Application accounts     │ 85       │ Quarterly│
  │ DBA accounts             │ 12       │ Monthly  │
  │ Read-only (analytics)    │ 18       │ Quarterly│
  │ Service accounts (ETL)   │ 8        │ Monthly  │
  │ ────────────────────── │ ────── │ ────── │
  │ TOTAL                  │ 123    │ Variable │
  └──────────────────────────┴──────────┴──────────┘

  Password policy:
    Length: 20+ characters
    Complexity: Upper, lower, number, special
    Rotation: 90 days (application), 30 days (DBA)
    Storage: Vault (HashiCorp) — no plaintext
  
  Network security:
    - Private subnets (no public access — all databases)
    - Security groups: Application-specific (restrict source)
    - VPC peering: Cross-account access (controlled)
    - SSL/TLS: Required (all connections encrypted)
    - SSH tunnel: DBA access (bastion host, 2FA)

ENCRYPTION:
  At rest:
    PostgreSQL: TDE (Transparent Data Encryption) — 100%
    MySQL: InnoDB encryption — 100%
    MongoDB: Encryption at rest — 100%
    Redis: AES-256 — 100%
    Key management: AWS KMS / Azure Key Vault
  
  In transit:
    All connections: TLS 1.2+ (required, no fallback)
    Certificate: Self-managed (auto-renewal, 90-day rotation)
    Certificate validation: Strict (server + client cert)
  
  Field-level encryption:
    PII columns: AES-256-GCM (application-level)
    Financial data: AES-256-GCM (application-level)
    Key rotation: Annual (automated)

AUDIT LOGGING:
  Audit scope: 100% (all production databases)
  Events logged:
    - All DDL changes (schema modification)
    - All DML changes (data modification — production)
    - All authentication attempts (success + failure)
    - All privilege changes (GRANT/REVOKE)
    - All backup/restore operations
    - All configuration changes
  
  Log retention: 1 year (hot) + 7 years (cold, compliance)
  Log analysis: Automated (anomaly detection — SIEM integration)
  Audit compliance: 100% (SOC 2, ISO 27001, PCI DSS)

VULNERABILITY MANAGEMENT:
  Patch management:
    - OS patches: Monthly (coordinated with IT patch cycle)
    - DB patches: Quarterly (tested in staging, then production)
    - Extension/library patches: As needed (vendor advisory)
  
  Vulnerability scanning:
    - Quarterly scan (commercial scanner — Qualys DB module)
    - Configuration audit (CIS benchmark — automated, weekly)
    - Compliance check (automated — daily)
  
  Security posture (January 2025):
    Critical vulnerabilities: 0 ✓
    High vulnerabilities: 0 ✓
    Medium vulnerabilities: 2 (patch scheduled — Feb)
    Configuration non-compliant: 0 ✓
    Audit findings: 0 ✓
```

## Output

### Database Operations Dashboard

```
DATABASE OPERATIONS DASHBOARD — Jan 2025
══════════════════════════════════════

Infrastructure:
  Total databases: 26 (12 PostgreSQL, 6 MySQL, 4 MongoDB, 4 Redis)
  HA coverage: 100% (production: 18)
  Failover test: Passed (Q4 2024 — all 18)
  Capacity: Storage 60%, CPU 42%, Memory 55% — all healthy

Performance:
  Avg. query time: 32ms (improving — 29% better)
  P99 latency: 120ms
  Slow queries: 2/hr (83% reduction)
  Cache hit rate: 98.5%
  Deadlocks: 0
  Replication lag: 0.8 sec

Backup & Recovery:
  Coverage: 100% (all 26 databases)
  Success rate: 99.8%
  Last restore test: 18/18 passed (100%)
  RTO: <1 hour (target: <2 hours) ✓
  RPO: <5 seconds (PITR)

Security:
  Least privilege: 123 accounts (reviewed)
  Encryption: 100% (at rest + in transit)
  Audit logging: 100% (production)
  Vulnerabilities: 0 critical, 0 high, 2 medium
  Compliance: 100% (SOC 2, ISO 27001, PCI DSS)

Schema:
  Migrations (Jan): 25 (production), 0 rollbacks
  Zero-downtime: 100%
  Avg. migration time: 45 seconds
  Tool: Flyway (RDBMS) + Mongomigrator (MongoDB)

Actions:
  1. Patch 2 medium vulnerabilities (February)
  2. Schema review (quarterly — February)
  3. Capacity planning (storage: 12TB, project 18TB)
  4. HA failover test (Q1 — April)
  5. Performance review (quarterly — March)
```

## Integration Points

- Database engines (PostgreSQL, MySQL, MongoDB, Redis): Native monitoring
- Cloud providers (AWS RDS, Azure SQL, Cosmos DB): Managed DB services
- Monitoring (Datadog, Prometheus, Grafana): Database metrics
- Backup (AWS Backup, Azure Backup, pgBackRest): Automated backup
- Connection pooling (PgBouncer, ProxySQL, HikariCP): Connection management
- Migration tools (Flyway, Liquibase, Mongomigrator): Schema management
- Secret management (HashiCorp Vault, AWS Secrets Manager): Credentials
- Orchestration (Kubernetes, Docker): Self-managed DB deployment
- ETL tools (Airflow, dbt, Talend): Data pipeline
- BI/Analytics (Tableau, Power BI, Metabase): Read replicas, data warehouse
- SIEM (Sentinel, Splunk): Audit log analysis
- ITSM (ServiceNow): Change management, incident tracking
- CMDB: Database inventory, relationship mapping

## Edge Cases

- **Database corruption**: Emergency restore; backup verification; root cause; prevention (checksum, replication)
- **Replication failure**: Lag alert; failover; data loss assessment; replication rebuild
- **Connection exhaustion**: Pool resize; app fix (leak); connection limit; monitoring
- **Performance degradation (slow queries)**: Query analysis; index optimization; cache; scaling
- **Schema migration failure**: Rollback (down migration); data integrity check; retry
- **Data breach (unauthorized access)**: Immediate lockdown; forensic; access review; compliance notification
- **Backup failure**: Alert; alternative backup; root cause; verification
- **Disk space exhaustion**: Auto-alert; cleanup; expansion; monitoring threshold
- **Cross-region disaster**: Failover activation; DNS update; data consistency; recovery
- **Encryption key compromise**: Key rotation; re-encryption; access review; incident response
