IT AI Skill
Capacity Planning Infrastructure
Plan and manage IT infrastructure capacity including compute, storage, network, and cloud resources. Use when forecasting infrastructure needs, right-sizing resources, planning data center expansion, optimizing cloud capacity, conducting capacity reviews, o...
IT Infrastructure Capacity Planning
Forecast, plan, and optimize IT infrastructure capacity to support current and future business needs.
Workflow
- Establish capacity baselines: current utilization for compute, storage, network, and applications.
- Collect historical data: 12-month utilization trends, growth rates, seasonal patterns.
- Forecast demand: business-driven (user growth, transaction volume) and technology-driven (new initiatives).
- Analyze capacity gaps: compare projected demand against current and planned capacity.
- Develop capacity plan: procurement timeline, budget, implementation schedule, risk mitigation.
- Implement right-sizing: optimize current resources before expanding (eliminate waste first).
- Monitor capacity thresholds: automated alerts at 70%, 80%, 90% utilization.
- Conduct quarterly capacity reviews: update forecasts, validate assumptions, adjust plans.
- Report capacity posture: executive dashboard, procurement recommendations, risk assessment.
- Execute capacity improvements: procurement, deployment, configuration, validation.
Compute Capacity Planning
COMPUTE CAPACITY FRAMEWORK
============================
Current State Assessment:
On-Premises Servers:
Total physical servers: [X]
Average utilization: CPU [Y]%, Memory [Z]%
Server density: [X] VMs per physical host (average)
Host capacity headroom: [X] additional VMs per host
Servers at > 80% utilization: [X] (need attention)
Idle/underutilized servers: [X] (candidates for consolidation)
End-of-life within 12 months: [X] (need replacement budget)
Virtual Machines:
Total VMs: [X] production, [Y] non-production
Average VM specs: [X] vCPU, [Y] GB RAM
Over-provisioned VMs: [X] (CPU < 10% for 30 days)
Under-provisioned VMs: [Y] (CPU > 80% for 30 days)
Snapshot count: [X] (clean up old snapshots)
Unused VMs: [X] (powered off > 30 days)
Cloud Compute:
Total instances: [X] (AWS EC2 + Azure VMs + GCP GCE)
Average utilization: CPU [Y]%, Memory [Z]%
Reserved instances: [X]% of total (target: > 70% for stable workloads)
Spot instances: [X]% of total (for fault-tolerant workloads)
Idle instances: [X] (no network traffic for 7+ days)
Zombie instances: [X] (running but no attached EBS/network)
Compute Growth Projection:
Demand drivers:
User growth: [X]% per quarter → [Y] additional VMs/instances
Application growth: [X] new services planned → [Y] additional compute
Batch/ETL growth: [X]% data growth → [Y] additional compute
Peak factor: [X]x average (handle 95th percentile, not average)
12-month projection:
Quarter Current VMs Projected Growth Headroom Action Required
──────── ─────────── ───────── ──────── ────────── ───────────────────
Q1 (now) 500 500 — 15% None
Q2 500 550 +10% 10% Monitor
Q3 550 610 +11% 4% ⚠️ Plan expansion
Q4 610 670 +10% -2% 🔴 Execute expansion
Action: Procure 3 new hosts (Q2 delivery) or increase cloud budget by 20%
Right-Sizing Recommendations:
Over-provisioned (downsize):
VM-001: 8 vCPU, 32 GB RAM → 4 vCPU, 16 GB RAM (CPU avg: 8%, Mem avg: 15%)
VM-002: 4 vCPU, 16 GB RAM → 2 vCPU, 8 GB RAM (CPU avg: 5%, Mem avg: 20%)
VM-003: 4 vCPU, 8 GB RAM → 2 vCPU, 4 GB RAM (CPU avg: 12%, Mem avg: 30%)
Estimated savings: 12 vCPU, 32 GB RAM → redeploy to other workloads
Under-provisioned (upsize):
VM-010: 2 vCPU, 4 GB RAM → 4 vCPU, 8 GB RAM (CPU avg: 85%, Mem avg: 90%)
VM-011: 2 vCPU, 4 GB RAM → 4 vCPU, 8 GB RAM (CPU avg: 78%, Mem avg: 85%)
Action: immediate resize to prevent performance degradation
Consolidation opportunities:
10 small VMs (1 vCPU each, < 10% CPU) → 1 medium VM (4 vCPU) with containers
Savings: 9 physical/VM hosts reclaimed; reduced management overhead
Storage Capacity Planning
STORAGE CAPACITY FRAMEWORK
============================
Current Storage Inventory:
On-Premises Storage:
SAN arrays: [X] arrays, [Y] TB total raw, [Z] TB usable
NAS/file servers: [X] TB total
Tape library: [X] TB (for backup archive)
Average utilization: [X]%
Growth rate: [Y] TB per month
Cloud Storage:
Block storage (EBS/Managed): [X] TB, $[Y]/month
Object storage (S3/Blob): [X] TB, $[Y]/month
File storage (EFS/DFS): [X] TB, $[Y]/month
Database storage: [X] TB
Backup storage: [X] TB
Archive storage (Glacier): [X] TB
Total cloud storage: [X] TB, $[Y]/month
Storage by tier:
Hot (frequent access): [X] TB — $[Y]/TB/month
Warm (occasional access): [X] TB — $[Y]/TB/month
Cool (rare access): [X] TB — $[Y]/TB/month
Archive (annual access): [X] TB — $[Y]/TB/month
Storage Growth Analysis:
Historical growth (last 12 months):
Month Total TB Growth TB Growth % Cost/Month Trend
────── ────────── ────────── ───────── ──────────── ─────
Jan 500 +8 +1.6% $5,000 ↑
Feb 508 +10 +2.0% $5,080 ↑
Mar 518 +12 +2.3% $5,180 ↑
Apr 530 +15 +2.9% $5,300 ↑↑
...
Dec 650 +20 +3.2% $6,500 ↑↑
Growth rate: accelerating (+0.5% per month)
Projected 12-month: 650 TB → 950 TB (+46% growth)
Alert: 80% threshold reached in [X] months
Storage Optimization:
Data lifecycle policies:
0–90 days: Hot storage (SSD/NVMe) — active data
90–180 days: Warm storage (S3 Standard-IA) — reduced access
180–365 days: Cool storage (S3 Glacier Flexible) — archive
365+ days: Deep archive (S3 Glacier Deep Archive) — compliance
Deduplication and compression:
Backups: 3:1 to 10:1 dedup ratio typical
VM images: 2:1 to 5:1 dedup ratio
Logs: 5:1 to 10:1 compression ratio
Estimated savings: 40–60% with proper lifecycle management
Wasted storage:
Unattached volumes: [X] TB ($[Y]/month wasted)
Old snapshots: [X] TB ($[Y]/month wasted)
Empty buckets/containers: [X] buckets
Duplicate files: [X] TB (identify with dedup tools)
Oversized disks: [X] disks > 80% free space
Total reclaimable: [X] TB ($[Y]/month savings)
Storage alerts:
70% utilization: WARNING — begin planning expansion
80% utilization: PLANNING — submit procurement request
90% utilization: CRITICAL — immediate action required
95% utilization: EMERGENCY — emergency procurement; may impact operations
Network Capacity Planning
NETWORK CAPACITY FRAMEWORK
============================
WAN Capacity:
Current WAN links:
Primary ISP: [X] Gbps, average [Y]%, peak [Z]%
Secondary ISP: [X] Gbps, average [Y]%, peak [Z]%
Backup (4G/5G): [X] Mbps (emergency only)
Direct Connect: [X] Gbps (cloud interconnect)
Utilization trend:
Month Avg Util Peak (95th) Growth Status
────── ────────── ─────────── ───────── ────────
Jan 45% 65% baseline 🟢
Apr 52% 72% +2%/mo 🟡 Watch
Jul 60% 80% +2.7%/mo 🟡 Plan
Oct 68% 88% +2.7%/mo 🔴 Act
Projected: 80% average utilization in [X] months
Recommendation: upgrade to [Y] Gbps or add secondary [Z] Gbps link
Budget: $[X]/month for upgraded link
LAN/Datacenter Network:
Core switch capacity:
Switching capacity: [X] Tbps
Port density: [X] ports (1G/10G/25G/40G/100G)
Utilization: [X]% of switching capacity
Available ports: [X] (for new servers)
Top-of-rack (ToR) switches:
Switches: [X] × 48-port 10G/25G switches
Uplinks: [X] × 40G/100G to core
Uplink utilization: [X]% (alert > 70%)
Available ports: [X] per switch average
Network growth drivers:
New servers: [X] planned → [Y] additional switch ports needed
Bandwidth growth: [X]% per quarter (video, backups, replication)
Cloud traffic: [X] Gbps to cloud (growing [Y]% per quarter)
Wireless Capacity:
Access points: [X] APs covering [Y] sq ft
Clients per AP: Average [Z] (target: < 30 for performance)
Channel utilization: Average [X]% (target: < 70%)
Growth: [X] new clients per quarter
Upgrade plan: Wi-Fi 6/6E/7 upgrade in [X] months
Cloud Capacity Planning
CLOUD CAPACITY AND COST FORECASTING
=====================================
Current Cloud Spend:
Provider Monthly Annual YoY Growth % of Budget Trend
───────── ────────── ──────── ──────────── ───────────── ─────
AWS $45,000 $540,000 +25% 52% ↑↑
Azure $25,000 $300,000 +15% 28% ↑
GCP $10,000 $120,000 +40% 11% ↑↑↑
Other $5,000 $60,000 +10% 9% →
────────────────────────────────────────────────────────────────────────
Total $85,000 $1,020,000 +22% 100% ↑
Cost by service category:
Category Monthly % of Total Growth Rate Optimization
────────────────── ────────── ──────────── ──────────── ──────────────
Compute (EC2/VMs) $30,000 35% +20% Right-size, RI
Database (RDS/etc) $15,000 18% +25% Read replicas
Storage (S3/Blob) $10,000 12% +30% Lifecycle policies
Network (data xfer) $8,000 9% +35% VPC endpoints
Container (EKS/etc) $7,000 8% +40% Auto-scaling
Managed Services $8,000 9% +15% Evaluate build vs buy
Other $7,000 9% +10% Review quarterly
Capacity projection (12 months):
Quarter Compute Storage Network Total Headroom Action
──────── ───────── ───────── ───────── ───────── ────────── ────────
Q1 $30K $10K $8K $48K 44% —
Q2 $33K $11K $9K $53K 38% —
Q3 $36K $12K $10K $58K 32% ⚠️
Q4 $40K $14K $12K $64K 24% 🔴
Budget cap: $65K/month (current)
Gap: Q4 projected at $64K — near budget cap
Action: implement cost optimization initiatives by Q2
Potential savings: $10K–$15K/month with RI + right-sizing + lifecycle
Cloud capacity optimization:
Reserved Instances / Savings Plans:
Current coverage: [X]% of eligible spend
Target coverage: > 75% for stable workloads
Savings potential: $[X]/month (15–30% discount)
Implementation: commit to 1-year or 3-year terms
Auto-scaling:
Current: [X] instances running 24/7
With auto-scaling: [Y] instances during off-hours
Savings: [X-Y] instances × $[Z]/month
Right-sizing:
Over-provisioned: [X] instances (CPU < 20% for 30 days)
Right-size candidates: [X] instances → smaller types
Savings: $[Y]/month
Spot instances:
Eligible workloads: [X]% (batch, CI/CD, testing, stateless)
Savings: 60–90% vs. on-demand
Risk: instance interruption (handle gracefully)
Integration Points
- Datadog / New Relic: Real-time infrastructure monitoring; utilization trends; forecasting; anomaly detection
- CloudHealth / Cloudability: Cloud cost management; rightsizing recommendations; reserved instance optimization; anomaly detection
- ServiceNow Capacity Management: Integrated with CMDB; trend analysis; threshold management; forecasting
- VMware vRealize Operations: vSphere capacity planning; predictive analytics; right-sizing; what-if scenarios
- Azure Capacity Optimization / AWS Compute Optimizer: Cloud-native right-sizing recommendations; utilization analysis
- SolarWinds Storage Resource Monitor: Storage capacity tracking; growth forecasting; threshold alerting
- NetBox / Device42: Infrastructure inventory; capacity tracking; documentation
- Custom dashboards (Grafana, Power BI): Capacity visualization; trend analysis; executive reporting
Edge Cases
- Seasonal capacity spikes (e-commerce holidays, quarterly reporting, tax season): Plan for peak capacity (not average); implement auto-scaling for cloud; maintain standby capacity for on-prem; pre-provision 2–4 weeks before peak; decommission/release after peak; monitor and right-size during peak
- E-commerce: Black Friday/Cyber Monday = 3–10x normal traffic; pre-scale 2 weeks before; maintain for 1 week after
- Financial: month-end close = 5–10x batch processing; schedule batch jobs to distribute load
- Tax season: 3–5x normal for financial systems; additional capacity February–April
- Rapid growth / scaling startup (10x user growth in 12 months): Cloud-first architecture for elasticity; auto-scaling from day one; capacity planning monthly (not quarterly); budget with 50–100% growth buffer; implement FinOps practices early; hire DevOps/SRE for infrastructure automation
- Architecture: microservices, containers, auto-scaling, serverless where possible
- Monitoring: real-time capacity dashboards; automated scaling policies; cost anomaly detection
- Budget: 3x projected spend as budget ceiling; monthly budget review
- Capacity constraints with budget freeze (business won't approve new spending): Optimize existing resources aggressively: consolidate servers, implement deduplication, right-size all resources, eliminate waste, negotiate cloud discounts; prioritize critical workloads; defer non-essential projects; implement strict chargeback/showback
- Quick wins: terminate idle resources (save 10–20% cloud spend), consolidate underutilized servers (save 20–30%), implement storage lifecycle (save 30–50% storage)
- Communication: show business impact of capacity shortage (revenue at risk, SLA breach)
- Multi-cloud capacity (workloads across AWS, Azure, GCP): Unified capacity dashboard across providers; consistent monitoring tool; cross-cloud load balancing; avoid vendor lock-in for capacity flexibility; negotiate enterprise agreements with each provider; consolidate billing and reporting
- Tooling: CloudHealth, Cloudability, or custom Terraform-based dashboard
- Strategy: primary cloud for core workloads; secondary for disaster recovery and cost optimization
- Cost: multi-cloud adds 10–20% management overhead; ensure business justification
- Edge/IoT capacity (thousands of edge devices): Lightweight monitoring agents; edge aggregation (reduce cloud data transfer); predictive maintenance (replace before failure); over-the-air update management; device lifecycle automation; battery life consideration for monitoring frequency
- Architecture: edge gateway per 100 devices; gateway handles local processing and aggregation
- Monitoring: heartbeat every 5–60 minutes; event-driven reporting for anomalies
- Capacity: plan for 10–20% device replacement rate per year; maintain spare inventory
- Disaster recovery capacity (standby capacity for failover): Pilot light (minimal capacity, scale on failover) vs. warm standby (partial capacity) vs. hot standby (full capacity); cost tradeoff: hot standby = 50–100% of production cost; pilot light = 10–20%
- RTO/RPO alignment: shorter RTO requires more standby capacity
- Testing: quarterly failover test (validate capacity and procedures)
- Cloud advantage: cloud DR significantly cheaper than on-prem DR (pay for what you use)