---
name: capacity-planning-infrastructure
description: Plan and manage IT infrastructure capacity including compute, storage, network, and cloud resources. Use when forecasting infrastructure needs, right-sizing resources, planning data center expansion, optimizing cloud capacity, conducting capacity reviews, or preventing resource exhaustion. Triggers on phrases like "capacity planning", "infrastructure capacity", "resource forecasting", "right-sizing", "capacity management", "growth planning", "infrastructure scaling", "resource optimization", "capacity threshold", "infrastructure roadmap".
---

# IT Infrastructure Capacity Planning

Forecast, plan, and optimize IT infrastructure capacity to support current and future business needs.

## Workflow

1. Establish capacity baselines: current utilization for compute, storage, network, and applications.
2. Collect historical data: 12-month utilization trends, growth rates, seasonal patterns.
3. Forecast demand: business-driven (user growth, transaction volume) and technology-driven (new initiatives).
4. Analyze capacity gaps: compare projected demand against current and planned capacity.
5. Develop capacity plan: procurement timeline, budget, implementation schedule, risk mitigation.
6. Implement right-sizing: optimize current resources before expanding (eliminate waste first).
7. Monitor capacity thresholds: automated alerts at 70%, 80%, 90% utilization.
8. Conduct quarterly capacity reviews: update forecasts, validate assumptions, adjust plans.
9. Report capacity posture: executive dashboard, procurement recommendations, risk assessment.
10. Execute capacity improvements: procurement, deployment, configuration, validation.

## Compute Capacity Planning

```
COMPUTE CAPACITY FRAMEWORK
============================

Current State Assessment:

  On-Premises Servers:
    Total physical servers:       [X]
    Average utilization:           CPU [Y]%, Memory [Z]%
    Server density:               [X] VMs per physical host (average)
    Host capacity headroom:       [X] additional VMs per host
    Servers at > 80% utilization:  [X] (need attention)
    Idle/underutilized servers:    [X] (candidates for consolidation)
    End-of-life within 12 months:  [X] (need replacement budget)

  Virtual Machines:
    Total VMs:                     [X] production, [Y] non-production
    Average VM specs:              [X] vCPU, [Y] GB RAM
    Over-provisioned VMs:          [X] (CPU < 10% for 30 days)
    Under-provisioned VMs:         [Y] (CPU > 80% for 30 days)
    Snapshot count:                [X] (clean up old snapshots)
    Unused VMs:                    [X] (powered off > 30 days)

  Cloud Compute:
    Total instances:               [X] (AWS EC2 + Azure VMs + GCP GCE)
    Average utilization:           CPU [Y]%, Memory [Z]%
    Reserved instances:            [X]% of total (target: > 70% for stable workloads)
    Spot instances:                [X]% of total (for fault-tolerant workloads)
    Idle instances:                [X] (no network traffic for 7+ days)
    Zombie instances:              [X] (running but no attached EBS/network)

Compute Growth Projection:

  Demand drivers:
    User growth:                   [X]% per quarter → [Y] additional VMs/instances
    Application growth:            [X] new services planned → [Y] additional compute
    Batch/ETL growth:              [X]% data growth → [Y] additional compute
    Peak factor:                   [X]x average (handle 95th percentile, not average)

  12-month projection:

    Quarter    Current VMs    Projected    Growth    Headroom    Action Required
    ────────   ───────────    ─────────   ────────  ──────────  ───────────────────
    Q1 (now)   500            500         —         15%         None
    Q2         500            550         +10%      10%         Monitor
    Q3         550            610         +11%      4%          ⚠️ Plan expansion
    Q4         610            670         +10%      -2%         🔴 Execute expansion

    Action: Procure 3 new hosts (Q2 delivery) or increase cloud budget by 20%

Right-Sizing Recommendations:

  Over-provisioned (downsize):
    VM-001: 8 vCPU, 32 GB RAM → 4 vCPU, 16 GB RAM (CPU avg: 8%, Mem avg: 15%)
    VM-002: 4 vCPU, 16 GB RAM → 2 vCPU, 8 GB RAM (CPU avg: 5%, Mem avg: 20%)
    VM-003: 4 vCPU, 8 GB RAM → 2 vCPU, 4 GB RAM (CPU avg: 12%, Mem avg: 30%)
    Estimated savings: 12 vCPU, 32 GB RAM → redeploy to other workloads

  Under-provisioned (upsize):
    VM-010: 2 vCPU, 4 GB RAM → 4 vCPU, 8 GB RAM (CPU avg: 85%, Mem avg: 90%)
    VM-011: 2 vCPU, 4 GB RAM → 4 vCPU, 8 GB RAM (CPU avg: 78%, Mem avg: 85%)
    Action: immediate resize to prevent performance degradation

  Consolidation opportunities:
    10 small VMs (1 vCPU each, < 10% CPU) → 1 medium VM (4 vCPU) with containers
    Savings: 9 physical/VM hosts reclaimed; reduced management overhead
```

## Storage Capacity Planning

```
STORAGE CAPACITY FRAMEWORK
============================

Current Storage Inventory:

  On-Premises Storage:
    SAN arrays:                    [X] arrays, [Y] TB total raw, [Z] TB usable
    NAS/file servers:              [X] TB total
    Tape library:                  [X] TB (for backup archive)
    Average utilization:           [X]%
    Growth rate:                   [Y] TB per month

  Cloud Storage:
    Block storage (EBS/Managed):   [X] TB, $[Y]/month
    Object storage (S3/Blob):      [X] TB, $[Y]/month
    File storage (EFS/DFS):        [X] TB, $[Y]/month
    Database storage:              [X] TB
    Backup storage:                [X] TB
    Archive storage (Glacier):     [X] TB
    Total cloud storage:           [X] TB, $[Y]/month

  Storage by tier:
    Hot (frequent access):         [X] TB — $[Y]/TB/month
    Warm (occasional access):      [X] TB — $[Y]/TB/month
    Cool (rare access):            [X] TB — $[Y]/TB/month
    Archive (annual access):       [X] TB — $[Y]/TB/month

Storage Growth Analysis:

  Historical growth (last 12 months):
    Month    Total TB    Growth TB    Growth %    Cost/Month    Trend
    ──────   ──────────  ──────────   ─────────   ────────────  ─────
    Jan      500         +8           +1.6%       $5,000        ↑
    Feb      508         +10          +2.0%       $5,080        ↑
    Mar      518         +12          +2.3%       $5,180        ↑
    Apr      530         +15          +2.9%       $5,300        ↑↑
    ...
    Dec      650         +20          +3.2%       $6,500        ↑↑

  Growth rate: accelerating (+0.5% per month)
  Projected 12-month: 650 TB → 950 TB (+46% growth)
  Alert: 80% threshold reached in [X] months

Storage Optimization:

  Data lifecycle policies:
    0–90 days:   Hot storage (SSD/NVMe) — active data
    90–180 days: Warm storage (S3 Standard-IA) — reduced access
    180–365 days: Cool storage (S3 Glacier Flexible) — archive
    365+ days:   Deep archive (S3 Glacier Deep Archive) — compliance

  Deduplication and compression:
    Backups: 3:1 to 10:1 dedup ratio typical
    VM images: 2:1 to 5:1 dedup ratio
    Logs: 5:1 to 10:1 compression ratio
    Estimated savings: 40–60% with proper lifecycle management

  Wasted storage:
    Unattached volumes:           [X] TB ($[Y]/month wasted)
    Old snapshots:                [X] TB ($[Y]/month wasted)
    Empty buckets/containers:     [X] buckets
    Duplicate files:              [X] TB (identify with dedup tools)
    Oversized disks:              [X] disks > 80% free space
    Total reclaimable:            [X] TB ($[Y]/month savings)

  Storage alerts:
    70% utilization:  WARNING — begin planning expansion
    80% utilization:  PLANNING — submit procurement request
    90% utilization:  CRITICAL — immediate action required
    95% utilization:  EMERGENCY — emergency procurement; may impact operations
```

## Network Capacity Planning

```
NETWORK CAPACITY FRAMEWORK
============================

WAN Capacity:

  Current WAN links:
    Primary ISP:         [X] Gbps, average [Y]%, peak [Z]%
    Secondary ISP:       [X] Gbps, average [Y]%, peak [Z]%
    Backup (4G/5G):     [X] Mbps (emergency only)
    Direct Connect:      [X] Gbps (cloud interconnect)

  Utilization trend:
    Month    Avg Util    Peak (95th)   Growth     Status
    ──────   ──────────  ───────────   ─────────  ────────
    Jan      45%         65%           baseline   🟢
    Apr      52%         72%           +2%/mo     🟡 Watch
    Jul      60%         80%           +2.7%/mo   🟡 Plan
    Oct      68%         88%           +2.7%/mo   🔴 Act

  Projected: 80% average utilization in [X] months
  Recommendation: upgrade to [Y] Gbps or add secondary [Z] Gbps link
  Budget: $[X]/month for upgraded link

LAN/Datacenter Network:

  Core switch capacity:
    Switching capacity:   [X] Tbps
    Port density:         [X] ports (1G/10G/25G/40G/100G)
    Utilization:          [X]% of switching capacity
    Available ports:      [X] (for new servers)

  Top-of-rack (ToR) switches:
    Switches:             [X] × 48-port 10G/25G switches
    Uplinks:              [X] × 40G/100G to core
    Uplink utilization:   [X]% (alert > 70%)
    Available ports:      [X] per switch average

  Network growth drivers:
    New servers:          [X] planned → [Y] additional switch ports needed
    Bandwidth growth:     [X]% per quarter (video, backups, replication)
    Cloud traffic:        [X] Gbps to cloud (growing [Y]% per quarter)

Wireless Capacity:

  Access points:          [X] APs covering [Y] sq ft
  Clients per AP:         Average [Z] (target: < 30 for performance)
  Channel utilization:    Average [X]% (target: < 70%)
  Growth:                 [X] new clients per quarter
  Upgrade plan:           Wi-Fi 6/6E/7 upgrade in [X] months
```

## Cloud Capacity Planning

```
CLOUD CAPACITY AND COST FORECASTING
=====================================

Current Cloud Spend:

  Provider    Monthly     Annual    YoY Growth    % of Budget    Trend
  ─────────   ──────────  ────────  ────────────  ─────────────  ─────
  AWS         $45,000     $540,000  +25%          52%            ↑↑
  Azure       $25,000     $300,000  +15%          28%            ↑
  GCP         $10,000     $120,000  +40%          11%            ↑↑↑
  Other       $5,000      $60,000   +10%          9%             →
  ────────────────────────────────────────────────────────────────────────
  Total       $85,000     $1,020,000 +22%         100%           ↑

Cost by service category:

  Category            Monthly     % of Total    Growth Rate    Optimization
  ──────────────────  ──────────  ────────────  ────────────   ──────────────
  Compute (EC2/VMs)   $30,000     35%           +20%           Right-size, RI
  Database (RDS/etc)  $15,000     18%           +25%           Read replicas
  Storage (S3/Blob)   $10,000     12%           +30%           Lifecycle policies
  Network (data xfer) $8,000      9%            +35%           VPC endpoints
  Container (EKS/etc) $7,000      8%            +40%           Auto-scaling
  Managed Services    $8,000      9%            +15%           Evaluate build vs buy
  Other               $7,000      9%            +10%           Review quarterly

Capacity projection (12 months):

  Quarter    Compute    Storage    Network    Total      Headroom    Action
  ────────   ─────────  ─────────  ─────────  ─────────  ──────────  ────────
  Q1        $30K       $10K       $8K        $48K       44%         —
  Q2        $33K       $11K       $9K        $53K       38%         —
  Q3        $36K       $12K       $10K       $58K       32%         ⚠️
  Q4        $40K       $14K       $12K       $64K       24%         🔴

  Budget cap: $65K/month (current)
  Gap: Q4 projected at $64K — near budget cap
  Action: implement cost optimization initiatives by Q2
  Potential savings: $10K–$15K/month with RI + right-sizing + lifecycle

Cloud capacity optimization:

  Reserved Instances / Savings Plans:
    Current coverage:    [X]% of eligible spend
    Target coverage:     > 75% for stable workloads
    Savings potential:   $[X]/month (15–30% discount)
    Implementation:      commit to 1-year or 3-year terms

  Auto-scaling:
    Current:             [X] instances running 24/7
    With auto-scaling:   [Y] instances during off-hours
    Savings:             [X-Y] instances × $[Z]/month

  Right-sizing:
    Over-provisioned:    [X] instances (CPU < 20% for 30 days)
    Right-size candidates: [X] instances → smaller types
    Savings:             $[Y]/month

  Spot instances:
    Eligible workloads:  [X]% (batch, CI/CD, testing, stateless)
    Savings:             60–90% vs. on-demand
    Risk:                instance interruption (handle gracefully)
```

## Integration Points

- **Datadog / New Relic**: Real-time infrastructure monitoring; utilization trends; forecasting; anomaly detection
- **CloudHealth / Cloudability**: Cloud cost management; rightsizing recommendations; reserved instance optimization; anomaly detection
- **ServiceNow Capacity Management**: Integrated with CMDB; trend analysis; threshold management; forecasting
- **VMware vRealize Operations**: vSphere capacity planning; predictive analytics; right-sizing; what-if scenarios
- **Azure Capacity Optimization / AWS Compute Optimizer**: Cloud-native right-sizing recommendations; utilization analysis
- **SolarWinds Storage Resource Monitor**: Storage capacity tracking; growth forecasting; threshold alerting
- **NetBox / Device42**: Infrastructure inventory; capacity tracking; documentation
- **Custom dashboards** (Grafana, Power BI): Capacity visualization; trend analysis; executive reporting

## Edge Cases

- **Seasonal capacity spikes** (e-commerce holidays, quarterly reporting, tax season): Plan for peak capacity (not average); implement auto-scaling for cloud; maintain standby capacity for on-prem; pre-provision 2–4 weeks before peak; decommission/release after peak; monitor and right-size during peak
  - E-commerce: Black Friday/Cyber Monday = 3–10x normal traffic; pre-scale 2 weeks before; maintain for 1 week after
  - Financial: month-end close = 5–10x batch processing; schedule batch jobs to distribute load
  - Tax season: 3–5x normal for financial systems; additional capacity February–April

- **Rapid growth / scaling startup** (10x user growth in 12 months): Cloud-first architecture for elasticity; auto-scaling from day one; capacity planning monthly (not quarterly); budget with 50–100% growth buffer; implement FinOps practices early; hire DevOps/SRE for infrastructure automation
  - Architecture: microservices, containers, auto-scaling, serverless where possible
  - Monitoring: real-time capacity dashboards; automated scaling policies; cost anomaly detection
  - Budget: 3x projected spend as budget ceiling; monthly budget review

- **Capacity constraints with budget freeze** (business won't approve new spending): Optimize existing resources aggressively: consolidate servers, implement deduplication, right-size all resources, eliminate waste, negotiate cloud discounts; prioritize critical workloads; defer non-essential projects; implement strict chargeback/showback
  - Quick wins: terminate idle resources (save 10–20% cloud spend), consolidate underutilized servers (save 20–30%), implement storage lifecycle (save 30–50% storage)
  - Communication: show business impact of capacity shortage (revenue at risk, SLA breach)

- **Multi-cloud capacity** (workloads across AWS, Azure, GCP): Unified capacity dashboard across providers; consistent monitoring tool; cross-cloud load balancing; avoid vendor lock-in for capacity flexibility; negotiate enterprise agreements with each provider; consolidate billing and reporting
  - Tooling: CloudHealth, Cloudability, or custom Terraform-based dashboard
  - Strategy: primary cloud for core workloads; secondary for disaster recovery and cost optimization
  - Cost: multi-cloud adds 10–20% management overhead; ensure business justification

- **Edge/IoT capacity** (thousands of edge devices): Lightweight monitoring agents; edge aggregation (reduce cloud data transfer); predictive maintenance (replace before failure); over-the-air update management; device lifecycle automation; battery life consideration for monitoring frequency
  - Architecture: edge gateway per 100 devices; gateway handles local processing and aggregation
  - Monitoring: heartbeat every 5–60 minutes; event-driven reporting for anomalies
  - Capacity: plan for 10–20% device replacement rate per year; maintain spare inventory

- **Disaster recovery capacity** (standby capacity for failover): Pilot light (minimal capacity, scale on failover) vs. warm standby (partial capacity) vs. hot standby (full capacity); cost tradeoff: hot standby = 50–100% of production cost; pilot light = 10–20%
  - RTO/RPO alignment: shorter RTO requires more standby capacity
  - Testing: quarterly failover test (validate capacity and procedures)
  - Cloud advantage: cloud DR significantly cheaper than on-prem DR (pay for what you use)
