IT AI Skill
Database Administration
Administer databases including performance tuning, backup/recovery, replication, high availability, schema management, and capacity planning. Use when optimizing database performance, managing backups, configuring replication, planning capacity, or handling...
Database Administration
Administer databases including performance tuning, backup/recovery, replication, high availability, and capacity planning.
Workflow
1. Database Performance Tuning
PERFORMANCE TUNING METHODOLOGY
═══════════════════════════════════════
DIAGNOSTIC PROCESS:
═══════════════════════════════════════
Step 1: Identify Slow Queries
→ Enable slow query log (threshold: 1 second)
→ Use EXPLAIN ANALYZE on top queries
→ Check query cache hit ratio
→ Monitor wait events
Step 2: Analyze Execution Plan
→ Full table scan? → Add index
→ Filesort? → Optimize ORDER BY
→ Temporary table? → Rewrite query
→ Lock wait? → Optimize transactions
→ Network wait? → Check connection pool
Step 3: Optimize
→ Add/modify indexes
→ Rewrite query (avoid SELECT *, subqueries)
→ Partition large tables
→ Materialize frequent aggregations
→ Tune connection pool
EXAMPLE: Slow Query Optimization
═══════════════════════════════════════
Before (3.2 seconds):
═══════════════════════════════════════
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2024-01-01'
AND c.country = 'US'
ORDER BY o.total_amount DESC
LIMIT 100;
EXPLAIN ANALYZE shows:
→ Sequential scan on orders (2.8M rows): 2,800ms
→ Sequential scan on customers (500K rows): 300ms
→ Sort operation: 100ms
After (45ms):
═══════════════════════════════════════
-- Add composite index
CREATE INDEX idx_orders_customer_created
ON orders (customer_id, created_at, total_amount DESC);
CREATE INDEX idx_customers_country
ON customers (country);
-- Optimized query
SELECT o.id, o.total_amount, o.created_at, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2024-01-01'
AND c.country = 'US'
ORDER BY o.total_amount DESC
LIMIT 100;
EXPLAIN ANALYZE shows:
→ Index scan on orders: 35ms
→ Index lookup on customers: 5ms
→ No sort (index already ordered): 5ms
→ Total: 45ms (71x improvement)
DATABASE HEALTH CHECK:
═══════════════════════════════════════
Metric Current Threshold Status
──────────────────────────────────────────────────────────────────
CPU utilization 45% <70% ✓ OK
Memory utilization 62% <80% ✓ OK
Disk IOPS 800 <2000 ✓ OK
Connection pool usage 65/100 <80 ✓ OK
Replication lag 200ms <1s ✓ OK
Deadlocks (last 24h) 0 0 ✓ OK
Slow queries (last 24h) 12 <50 ⚠ Review
Table bloat 8% <15% ✓ OK
Index hit ratio 99.2% >95% ✓ OK
Buffer cache hit ratio 98.5% >95% ✓ OK
2. Backup & Recovery Strategy
BACKUP STRATEGY
═══════════════════════════════════════
Backup Type Frequency Retention RPO RTO
────────────────────────────────────────────────────────────────
Full backup Weekly 4 weeks 24 hours 1 hour
Incremental Daily 14 days 24 hours 30 min
WAL/Transaction log Continuous 7 days 5 min 5 min
Point-in-time Continuous 30 days 1 min 15 min
BACKUP SCHEDULE:
═══════════════════════════════════════
Sunday 2:00 AM: Full backup (all databases)
Monday-Saturday 2:00 AM: Incremental backup
Continuous: WAL archiving to S3
Every 15 minutes: Snapshot (automated)
RECOVERY PROCEDURES:
═══════════════════════════════════════
Scenario 1: Accidental Data Deletion
→ Restore from PITR to point before deletion
→ Extract specific tables/rows from restore
→ Apply to production (point-in-time recovery)
→ RTO: 15-30 minutes
Scenario 2: Corrupted Table
→ Restore table from latest full + incrementals
→ Verify data integrity
→ Swap tables (minimal downtime)
→ RTO: 1-2 hours
Scenario 3: Full Database Failure
→ Failover to replica (automatic if configured)
→ If replica unavailable: Restore from latest backup
→ Replay WAL to recovery point
→ RTO: 30 minutes (with replica), 2 hours (without)
Scenario 4: Disaster Recovery
→ Failover to DR region
→ Restore from cross-region backup
→ Update DNS routing
→ RTO: 4 hours
3. High Availability & Replication
HIGH AVAILABILITY ARCHITECTURE
═══════════════════════════════════════
PostgreSQL HA Setup:
═══════════════════════════════════════
Primary (Writer):
→ Instance: db.r6i.xlarge (4 vCPU, 32 GB)
→ Storage: 500 GB gp3 (3000 IOPS)
→ Role: Accepts read + write
Read Replicas (Readers):
→ 2x db.r6i.large (2 vCPU, 16 GB)
→ Storage: 500 GB gp3
→ Role: Read-only (auto-balanced)
→ Replication lag: <500ms
Connection Routing:
→ PgBouncer (connection pooler)
→ Write connections → Primary
→ Read connections → Replicas (round-robin)
→ Session persistence for transactions
Failover:
→ Patroni + etcd (automatic leader election)
→ Failover time: <30 seconds
→ Automatic DNS update (Route53)
→ Application retry logic (exponential backoff)
MongoDB Replica Set:
═══════════════════════════════════════
Primary: 1x mongos + config servers
Secondaries: 2x (synchronous replication)
Arbiter: 1x (for odd-number voting)
Write concern: majority (durability)
Read preference: secondaryPreferred (scale reads)
4. Schema Management & Migration
SCHEMA VERSIONING STRATEGY
═══════════════════════════════════════
Tools: Flyway / Liquibase / Alembic
Migration Process:
═══════════════════════════════════════
1. Developer creates migration script
→ Up: Apply change (add column, create table)
→ Down: Revert change (drop column, drop table)
2. CI pipeline validates
→ Syntax check
→ Test against staging database
→ Performance impact assessment
3. Deploy to staging
→ Apply migration
→ Run integration tests
→ Verify no regressions
4. Deploy to production
→ Apply during maintenance window (if downtime)
→ Use zero-downtime migration patterns (expand/contract)
→ Monitor for errors
ZERO-DOWNTIME MIGRATION PATTERNS:
═══════════════════════════════════════
Pattern 1: Add Column (backward compatible)
→ Add new column (nullable)
→ Deploy code that reads both old and new
→ Backfill new column from old data
→ Deploy code using only new column
→ Remove old column
Pattern 2: Change Column Type
→ Add new column with new type
→ Dual-write (write to both columns)
→ Backfill new column
→ Switch reads to new column
→ Stop dual-write
→ Drop old column
Pattern 3: Rename Table
→ Create new table with new name
→ Dual-write to both tables
→ Backfill new table
→ Switch reads to new table
→ Drop old table
5. Capacity Planning
CAPACITY PLANNING
═══════════════════════════════════════
GROWTH ANALYSIS:
═══════════════════════════════════════
Metric Current Growth Rate 6-Month 12-Month Threshold
(Monthly) (Monthly) Projection Projection
────────────────────────────────────────────────────────────────────────────────
Database size 250 GB 5% 336 GB 438 GB 500 GB ⚠
Daily transactions 2.5M 8% 4.6M 8.4M 10M
Peak connections 65/100 6% 93/100 165/100 100 🔴
Avg query time 45ms — 55ms 75ms 100ms
Storage IOPS 800/3000 7% 1,500 2,700 3,000 ⚠
CAPACITY ACTIONS:
═══════════════════════════════════════
Immediate (this quarter):
→ Scale up connection pool (100 → 200)
→ Add read replica (offload reporting queries)
→ Archive data older than 2 years (reduce size by 40%)
6-Month:
→ Upgrade to larger instance (r6i.xlarge → r6i.2xlarge)
→ Implement table partitioning (orders table by date)
→ Optimize top 10 slow queries
12-Month:
→ Evaluate database sharding (if single DB becomes bottleneck)
→ Migrate to cloud-native database (Aurora/Cloud Spanner)
→ Implement query result caching (Redis)
Edge Cases
- Multi-region: Cross-region replication, conflict resolution
- Data residency: GDPR, data localization requirements
- High write volume: Write-ahead log optimization, batching
- Large tables: Partitioning, archiving, materialized views
- Mixed workloads: OLTP vs OLAP separation
Integration Points
- Databases: PostgreSQL, MySQL, MongoDB, Redis, Elasticsearch
- Monitoring: Prometheus, Grafana, Datadog, pg_stat_statements
- Backup: pgBackRest, mongodump, AWS RDS snapshots
- Orchestration: Patroni, Kubernetes Operators
- CI/CD: Flyway, Liquibase, GitHub Actions
- Cloud: AWS RDS, Azure SQL, GCP Cloud SQL
Output
Database Status
DATABASE STATUS — Production
═══════════════════════════════════════
Health: GOOD (all metrics within thresholds)
Performance: Avg 45ms query time, 99.2% index hit ratio
HA: Primary + 2 replicas, failover <30s
Backup: Full weekly + continuous WAL archiving
Capacity: Connection pool at 65% — scale by Q2
Actions:
→ Optimize 12 slow queries (identified)
→ Archive 2+ year data (free 100GB)
→ Scale connection pool (100 → 200)