IT AI Skill
Database Operations
Manage database operations including performance tuning, backup and recovery, high availability, replication, schema management, database migrations, capacity planning, and database security. Use when optimizing database performance, managing database backu...
Database Operations
Manage database performance, reliability, security, and capacity across all database platforms.
Database Infrastructure
Database Inventory & Architecture
DATABASE INFRASTRUCTURE:
════════════════════════
DATABASE INVENTORY:
┌──────────────────────────┬──────────┬──────────┬──────────────────┐
│ Database │ Count │ Type │ Primary Use │
├──────────────────────────┼──────────┼──────────┼──────────────────┤
│ PostgreSQL │ 12 │ RDBMS │ Applications, CRM│
│ MySQL │ 6 │ RDBMS │ Web apps, CMS │
│ MongoDB │ 4 │ NoSQL │ Document store │
│ Redis │ 4 │ Cache │ Session, cache │
│ ────────────────────── │ ────── │ ────── │ ─────────────── │
│ TOTAL │ 26 │ Mixed │ │
└──────────────────────────┴──────────┴──────────┴──────────────────┘
Deployment model:
Managed cloud: 20 databases (AWS RDS, Azure SQL, Azure Cosmos DB)
Self-managed: 4 databases (K8s pods — legacy, migration planned)
Serverless: 2 databases (AWS Aurora Serverless, Azure Cosmos)
Data classification:
Production: 18 databases (critical — HA enabled)
Staging: 4 databases (mirrors production — anonymized data)
Development: 4 databases (local/dev — test data)
HIGH AVAILABILITY:
Production databases (18):
┌──────────────────────────┬──────────┬──────────────────┬────────────┐
│ HA Strategy │ Count │ RTO │ RPO │
├──────────────────────────┼──────────┼──────────────────┼────────────┤
│ Multi-AZ (auto-failover) │ 12 │ <60 seconds │ <5 sec │
│ Read replicas (3-5) │ 12 │ N/A (read-only) │ <30 sec │
│ Cross-region replication │ 8 │ <5 minutes │ <1 min │
│ Point-in-time recovery │ 18 │ <1 minute │ <5 sec │
│ Automated backup │ 18 │ <1 hour │ <1 hour │
└──────────────────────────┴──────────┴──────────────────┴────────────┘
HA testing:
Chaos engineering: Quarterly (failover test — maintenance window)
Last test: Q4 2024 (January 2025)
Result: All 18 databases passed (failover <60 seconds, RPO met)
Next test: Q1 2025 (April)
Failover history (January 2025):
Planned: 2 (maintenance — zero downtime)
Unplanned: 0 (no incidents)
Total failover time: 45 seconds (avg., planned)
CAPACITY OVERVIEW:
┌──────────────────────────┬──────────┬──────────┬──────────┐
│ Metric │ Current │ Max │ Util % │
├──────────────────────────┼──────────┼──────────┼──────────┤
│ Total storage │ 12 TB │ 20 TB │ 60% │
│ CPU (avg.) │ 42% │ 80% │ 42% │
│ Memory (avg.) │ 55% │ 85% │ 55% │
│ Connections (avg.) │ 480 │ 1,000 │ 48% │
│ IOPS (avg.) │ 8,500 │ 20,000 │ 42.5% │
│ ────────────────────── │ ────── │ ────── │ ────── │
│ Status │ Healthy│ Healthy │ Healthy │
└──────────────────────────┴──────────┴──────────┴──────────┘
12-month projection:
Storage: 12 TB → 18 TB (50% growth — add capacity at 15 TB)
Connections: 480 → 720 (50% growth — monitor at 700)
IOPS: 8,500 → 12,750 (50% growth — monitor at 12K)
Scaling triggers:
Storage: Alert at 80% (16 TB), scale at 85% (17 TB)
CPU: Alert at 70%, scale at 80%
Memory: Alert at 75%, scale at 85%
Connections: Alert at 70%, scale at 80%
Performance Management
Query Optimization & Tuning
PERFORMANCE MANAGEMENT:
══════════════════════
MONITORING METRICS (Real-time):
┌──────────────────────────┬──────────┬──────────┬──────────┐
│ Metric │ Warning │ Critical │ Current │
├──────────────────────────┼──────────┼──────────┼──────────┤
│ CPU utilization │ >70% │ >85% │ 42% │
│ Memory utilization │ >75% │ >90% │ 55% │
│ Disk I/O latency │ >10ms │ >50ms │ 4.2ms │
│ Connections used │ >70% │ >85% │ 48% │
│ Replication lag │ >10 sec │ >60 sec │ 0.8 sec │
│ Slow queries (>1s) │ >5/hr │ >20/hr │ 2/hr │
│ Deadlocks │ >0 │ >3 │ 0 │
│ Cache hit rate │ <95% │ <90% │ 98.5% │
│ Table bloat │ >20% │ >40% │ 8% │
│ ────────────────────── │ ────── │ ────── │ ────── │
│ Status │ — │ — │ Healthy │
└──────────────────────────┴──────────┴──────────┴──────────┘
Monitoring tools:
PostgreSQL: pg_stat_statements + Datadog DB monitor
MySQL: Performance Schema + Datadog DB monitor
MongoDB: DB Profiler + Datadog DB monitor
Redis: INFO command + Datadog Redis monitor
QUERY OPTIMIZATION:
Slow query threshold: >1 second (alert), >5 seconds (critical)
Slow query analysis (January 2025):
Total slow queries: 58 (2/hr avg.)
┌──────────────────────────┬──────────┬──────────┐
│ Root Cause │ Count │ Status │
├──────────────────────────┼──────────┼──────────┤
│ Missing index │ 18 │ 15 fixed │
│ Suboptimal query plan │ 12 │ 10 fixed │
│ N+1 queries │ 8 │ 6 fixed │
│ Large table scan │ 10 │ 8 fixed │
│ Lock contention │ 5 │ 5 fixed │
│ Application issue │ 5 │ 4 fixed │
│ ────────────────────── │ ────── │ ────── │
│ TOTAL │ 58 │ 48 fixed │
└──────────────────────────┴──────────┴──────────┘
Optimization techniques:
1. Index optimization (add, remove, composite, partial)
2. Query rewriting (JOIN vs. subquery, CTE optimization)
3. Query plan analysis (EXPLAIN ANALYZE, force index)
4. Table partitioning (large tables — by date, range)
5. Materialized views (frequent aggregation, summary)
6. Connection pooling (PgBouncer, ProxySQL — reduce overhead)
7. Caching layer (Redis — reduce DB load)
8. Read replicas (offload read queries)
Performance improvement (quarterly trend):
Q4 2024: Avg. query time 45ms → Q1 2025: 32ms (29% improvement)
Slow queries: 12/hr → 2/hr (83% reduction)
P99 latency: 250ms → 120ms (52% improvement)
CONNECTION MANAGEMENT:
Connection pools:
PgBouncer (PostgreSQL): 12 instances
ProxySQL (MySQL): 6 instances
Max connections per pool: 50
Idle timeout: 300 seconds
Health check: Every 30 seconds
Connection statistics:
Avg. active connections: 480 (of 1,000 max)
Peak connections: 620 (during business hours)
Connection wait time: 2.5ms (avg.)
Connection errors: 0 (January 2025)
Connection best practices:
- Connection pooling (all applications)
- Max connection limit (per app, per DB)
- Idle connection timeout (prevent leak)
- Connection validation (health check)
- Retry logic (transient error handling)
SCHEMA MANAGEMENT:
Schema versioning: Flyway (PostgreSQL/MySQL) + Mongomigrator (MongoDB)
Migration workflow:
1. Developer writes migration (up + down)
2. CI pipeline: Validate migration (syntax, conflict check)
3. Staging: Apply migration (test)
4. Production: Apply migration (during maintenance window)
5. Rollback: Down migration (if needed — tested)
Migration policy:
- Backward-compatible (zero downtime — expand-contract)
- Data migration: Batch (large tables) + monitoring
- DDL lock avoidance: Online DDL (PostgreSQL online schema change)
- Review: DBA + lead engineer (every migration PR)
Schema change statistics (January):
Migrations applied: 28 (staging: 28, production: 25)
Rollbacks: 0 (all successful)
Avg. migration time: 45 seconds (production)
Zero-downtime: 100% (backward-compatible)
Backup & Recovery
Data Protection
BACKUP & RECOVERY:
══════════════════
BACKUP STRATEGY:
┌──────────────────────────┬──────────┬──────────┬──────────┐
│ Backup Type │ Freq │ Retention│ RPO │
├──────────────────────────┼──────────┼──────────┼──────────┤
│ Full backup │ Daily │ 30 days │ 24 hrs │
│ Incremental │ Hourly │ 7 days │ 1 hr │
│ WAL/transaction log │ Continuous│ 7 days │ <5 sec │
│ Point-in-time (PITR) │ Continuous│ 35 days │ <5 sec │
│ Cross-region │ Daily │ 14 days │ 24 hrs │
│ Snapshot (storage-level) │ Daily │ 7 days │ 1 hr │
└──────────────────────────┴──────────┴──────────┴──────────┘
Backup coverage: 100% (all 26 databases)
Backup success rate: 99.8% (target: >99%) ✓
Backup verification: Weekly (test restore — staging)
Backup encryption: AES-256 (at rest) + TLS (in transit)
BACKUP TESTING:
Monthly restore test (all production databases):
Test database: Restore to isolated environment
Validation: Data integrity + application functionality
Duration: 2-4 hours (per database, parallel)
Result (January): 18/18 passed (100%)
Recovery Time Objective (RTO):
Full database restore: <1 hour (target: <2 hours) ✓
Point-in-time recovery: <5 minutes (target: <10 min) ✓
Cross-region recovery: <5 minutes (failover) ✓
Table-level recovery: <10 minutes (from backup) ✓
DATA RETENTION:
Production data: 7 years (compliance — financial, HR)
Application data: 3 years (business requirement)
Log data: 1 year (audit, troubleshooting)
Temporary data: 90 days (auto-cleanup)
Compliance requirements:
Financial: 7 years (SOX, audit)
HR: 7 years (employment law)
Customer data: 3 years (privacy — GDPR)
Application data: 3 years (business need)
Data classification:
PII: 12 databases (encrypted at rest + in transit)
Financial: 6 databases (encrypted + audit logging)
Operational: 8 databases (encrypted at rest)
Database Security
Access & Compliance
DATABASE SECURITY:
══════════════════
ACCESS CONTROL:
Principle: Least privilege (per application, per user)
Database users: 85 (application accounts) + 12 (DBA accounts)
┌──────────────────────────┬──────────┬──────────┐
│ Access Type │ Count │ Review │
├──────────────────────────┼──────────┼──────────┤
│ Application accounts │ 85 │ Quarterly│
│ DBA accounts │ 12 │ Monthly │
│ Read-only (analytics) │ 18 │ Quarterly│
│ Service accounts (ETL) │ 8 │ Monthly │
│ ────────────────────── │ ────── │ ────── │
│ TOTAL │ 123 │ Variable │
└──────────────────────────┴──────────┴──────────┘
Password policy:
Length: 20+ characters
Complexity: Upper, lower, number, special
Rotation: 90 days (application), 30 days (DBA)
Storage: Vault (HashiCorp) — no plaintext
Network security:
- Private subnets (no public access — all databases)
- Security groups: Application-specific (restrict source)
- VPC peering: Cross-account access (controlled)
- SSL/TLS: Required (all connections encrypted)
- SSH tunnel: DBA access (bastion host, 2FA)
ENCRYPTION:
At rest:
PostgreSQL: TDE (Transparent Data Encryption) — 100%
MySQL: InnoDB encryption — 100%
MongoDB: Encryption at rest — 100%
Redis: AES-256 — 100%
Key management: AWS KMS / Azure Key Vault
In transit:
All connections: TLS 1.2+ (required, no fallback)
Certificate: Self-managed (auto-renewal, 90-day rotation)
Certificate validation: Strict (server + client cert)
Field-level encryption:
PII columns: AES-256-GCM (application-level)
Financial data: AES-256-GCM (application-level)
Key rotation: Annual (automated)
AUDIT LOGGING:
Audit scope: 100% (all production databases)
Events logged:
- All DDL changes (schema modification)
- All DML changes (data modification — production)
- All authentication attempts (success + failure)
- All privilege changes (GRANT/REVOKE)
- All backup/restore operations
- All configuration changes
Log retention: 1 year (hot) + 7 years (cold, compliance)
Log analysis: Automated (anomaly detection — SIEM integration)
Audit compliance: 100% (SOC 2, ISO 27001, PCI DSS)
VULNERABILITY MANAGEMENT:
Patch management:
- OS patches: Monthly (coordinated with IT patch cycle)
- DB patches: Quarterly (tested in staging, then production)
- Extension/library patches: As needed (vendor advisory)
Vulnerability scanning:
- Quarterly scan (commercial scanner — Qualys DB module)
- Configuration audit (CIS benchmark — automated, weekly)
- Compliance check (automated — daily)
Security posture (January 2025):
Critical vulnerabilities: 0 ✓
High vulnerabilities: 0 ✓
Medium vulnerabilities: 2 (patch scheduled — Feb)
Configuration non-compliant: 0 ✓
Audit findings: 0 ✓
Output
Database Operations Dashboard
DATABASE OPERATIONS DASHBOARD — Jan 2025
══════════════════════════════════════
Infrastructure:
Total databases: 26 (12 PostgreSQL, 6 MySQL, 4 MongoDB, 4 Redis)
HA coverage: 100% (production: 18)
Failover test: Passed (Q4 2024 — all 18)
Capacity: Storage 60%, CPU 42%, Memory 55% — all healthy
Performance:
Avg. query time: 32ms (improving — 29% better)
P99 latency: 120ms
Slow queries: 2/hr (83% reduction)
Cache hit rate: 98.5%
Deadlocks: 0
Replication lag: 0.8 sec
Backup & Recovery:
Coverage: 100% (all 26 databases)
Success rate: 99.8%
Last restore test: 18/18 passed (100%)
RTO: <1 hour (target: <2 hours) ✓
RPO: <5 seconds (PITR)
Security:
Least privilege: 123 accounts (reviewed)
Encryption: 100% (at rest + in transit)
Audit logging: 100% (production)
Vulnerabilities: 0 critical, 0 high, 2 medium
Compliance: 100% (SOC 2, ISO 27001, PCI DSS)
Schema:
Migrations (Jan): 25 (production), 0 rollbacks
Zero-downtime: 100%
Avg. migration time: 45 seconds
Tool: Flyway (RDBMS) + Mongomigrator (MongoDB)
Actions:
1. Patch 2 medium vulnerabilities (February)
2. Schema review (quarterly — February)
3. Capacity planning (storage: 12TB, project 18TB)
4. HA failover test (Q1 — April)
5. Performance review (quarterly — March)
Integration Points
- Database engines (PostgreSQL, MySQL, MongoDB, Redis): Native monitoring
- Cloud providers (AWS RDS, Azure SQL, Cosmos DB): Managed DB services
- Monitoring (Datadog, Prometheus, Grafana): Database metrics
- Backup (AWS Backup, Azure Backup, pgBackRest): Automated backup
- Connection pooling (PgBouncer, ProxySQL, HikariCP): Connection management
- Migration tools (Flyway, Liquibase, Mongomigrator): Schema management
- Secret management (HashiCorp Vault, AWS Secrets Manager): Credentials
- Orchestration (Kubernetes, Docker): Self-managed DB deployment
- ETL tools (Airflow, dbt, Talend): Data pipeline
- BI/Analytics (Tableau, Power BI, Metabase): Read replicas, data warehouse
- SIEM (Sentinel, Splunk): Audit log analysis
- ITSM (ServiceNow): Change management, incident tracking
- CMDB: Database inventory, relationship mapping
Edge Cases
- Database corruption: Emergency restore; backup verification; root cause; prevention (checksum, replication)
- Replication failure: Lag alert; failover; data loss assessment; replication rebuild
- Connection exhaustion: Pool resize; app fix (leak); connection limit; monitoring
- Performance degradation (slow queries): Query analysis; index optimization; cache; scaling
- Schema migration failure: Rollback (down migration); data integrity check; retry
- Data breach (unauthorized access): Immediate lockdown; forensic; access review; compliance notification
- Backup failure: Alert; alternative backup; root cause; verification
- Disk space exhaustion: Auto-alert; cleanup; expansion; monitoring threshold
- Cross-region disaster: Failover activation; DNS update; data consistency; recovery
- Encryption key compromise: Key rotation; re-encryption; access review; incident response