IT AI Skill

Capacity Planning Infrastructure

Plan and manage IT infrastructure capacity including compute, storage, network, and cloud resources. Use when forecasting infrastructure needs, right-sizing resources, planning data center expansion, optimizing cloud capacity, conducting capacity reviews, or preventing resource exhaustion. Triggers on phrases like "capacity planning", "infrastructure capacity", "resource forecasting", "right-sizing", "capacity management", "growth planning", "infrastructure scaling", "resource optimization", "capacity threshold", "infrastructure roadmap".

IT Infrastructure Capacity Planning

Forecast, plan, and optimize IT infrastructure capacity to support current and future business needs.

Workflow

Establish capacity baselines: current utilization for compute, storage, network, and applications.
Collect historical data: 12-month utilization trends, growth rates, seasonal patterns.
Forecast demand: business-driven (user growth, transaction volume) and technology-driven (new initiatives).
Analyze capacity gaps: compare projected demand against current and planned capacity.
Develop capacity plan: procurement timeline, budget, implementation schedule, risk mitigation.
Implement right-sizing: optimize current resources before expanding (eliminate waste first).
Monitor capacity thresholds: automated alerts at 70%, 80%, 90% utilization.
Conduct quarterly capacity reviews: update forecasts, validate assumptions, adjust plans.
Report capacity posture: executive dashboard, procurement recommendations, risk assessment.
Execute capacity improvements: procurement, deployment, configuration, validation.

Compute Capacity Planning

COMPUTE CAPACITY FRAMEWORK
============================

Current State Assessment:

  On-Premises Servers:
    Total physical servers:       [X]
    Average utilization:           CPU [Y]%, Memory [Z]%
    Server density:               [X] VMs per physical host (average)
    Host capacity headroom:       [X] additional VMs per host
    Servers at > 80% utilization:  [X] (need attention)
    Idle/underutilized servers:    [X] (candidates for consolidation)
    End-of-life within 12 months:  [X] (need replacement budget)

  Virtual Machines:
    Total VMs:                     [X] production, [Y] non-production
    Average VM specs:              [X] vCPU, [Y] GB RAM
    Over-provisioned VMs:          [X] (CPU < 10% for 30 days)
    Under-provisioned VMs:         [Y] (CPU > 80% for 30 days)
    Snapshot count:                [X] (clean up old snapshots)
    Unused VMs:                    [X] (powered off > 30 days)

  Cloud Compute:
    Total instances:               [X] (AWS EC2 + Azure VMs + GCP GCE)
    Average utilization:           CPU [Y]%, Memory [Z]%
    Reserved instances:            [X]% of total (target: > 70% for stable workloads)
    Spot instances:                [X]% of total (for fault-tolerant workloads)
    Idle instances:                [X] (no network traffic for 7+ days)
    Zombie instances:              [X] (running but no attached EBS/network)

Compute Growth Projection:

  Demand drivers:
    User growth:                   [X]% per quarter → [Y] additional VMs/instances
    Application growth:            [X] new services planned → [Y] additional compute
    Batch/ETL growth:              [X]% data growth → [Y] additional compute
    Peak factor:                   [X]x average (handle 95th percentile, not average)

  12-month projection:

    Quarter    Current VMs    Projected    Growth    Headroom    Action Required
    ────────   ───────────    ─────────   ────────  ──────────  ───────────────────
    Q1 (now)   500            500         —         15%         None
    Q2         500            550         +10%      10%         Monitor
    Q3         550            610         +11%      4%          ⚠️ Plan expansion
    Q4         610            670         +10%      -2%         🔴 Execute expansion

    Action: Procure 3 new hosts (Q2 delivery) or increase cloud budget by 20%

Right-Sizing Recommendations:

  Over-provisioned (downsize):
    VM-001: 8 vCPU, 32 GB RAM → 4 vCPU, 16 GB RAM (CPU avg: 8%, Mem avg: 15%)
    VM-002: 4 vCPU, 16 GB RAM → 2 vCPU, 8 GB RAM (CPU avg: 5%, Mem avg: 20%)
    VM-003: 4 vCPU, 8 GB RAM → 2 vCPU, 4 GB RAM (CPU avg: 12%, Mem avg: 30%)
    Estimated savings: 12 vCPU, 32 GB RAM → redeploy to other workloads

  Under-provisioned (upsize):
    VM-010: 2 vCPU, 4 GB RAM → 4 vCPU, 8 GB RAM (CPU avg: 85%, Mem avg: 90%)
    VM-011: 2 vCPU, 4 GB RAM → 4 vCPU, 8 GB RAM (CPU avg: 78%, Mem avg: 85%)
    Action: immediate resize to prevent performance degradation

  Consolidation opportunities:
    10 small VMs (1 vCPU each, < 10% CPU) → 1 medium VM (4 vCPU) with containers
    Savings: 9 physical/VM hosts reclaimed; reduced management overhead

Storage Capacity Planning

STORAGE CAPACITY FRAMEWORK
============================

Current Storage Inventory:

  On-Premises Storage:
    SAN arrays:                    [X] arrays, [Y] TB total raw, [Z] TB usable
    NAS/file servers:              [X] TB total
    Tape library:                  [X] TB (for backup archive)
    Average utilization:           [X]%
    Growth rate:                   [Y] TB per month

  Cloud Storage:
    Block storage (EBS/Managed):   [X] TB, $[Y]/month
    Object storage (S3/Blob):      [X] TB, $[Y]/month
    File storage (EFS/DFS):        [X] TB, $[Y]/month
    Database storage:              [X] TB
    Backup storage:                [X] TB
    Archive storage (Glacier):     [X] TB
    Total cloud storage:           [X] TB, $[Y]/month

  Storage by tier:
    Hot (frequent access):         [X] TB — $[Y]/TB/month
    Warm (occasional access):      [X] TB — $[Y]/TB/month
    Cool (rare access):            [X] TB — $[Y]/TB/month
    Archive (annual access):       [X] TB — $[Y]/TB/month

Storage Growth Analysis:

  Historical growth (last 12 months):
    Month    Total TB    Growth TB    Growth %    Cost/Month    Trend
    ──────   ──────────  ──────────   ─────────   ────────────  ─────
    Jan      500         +8           +1.6%       $5,000        ↑
    Feb      508         +10          +2.0%       $5,080        ↑
    Mar      518         +12          +2.3%       $5,180        ↑
    Apr      530         +15          +2.9%       $5,300        ↑↑
    ...
    Dec      650         +20          +3.2%       $6,500        ↑↑

  Growth rate: accelerating (+0.5% per month)
  Projected 12-month: 650 TB → 950 TB (+46% growth)
  Alert: 80% threshold reached in [X] months

Storage Optimization:

  Data lifecycle policies:
    0–90 days:   Hot storage (SSD/NVMe) — active data
    90–180 days: Warm storage (S3 Standard-IA) — reduced access
    180–365 days: Cool storage (S3 Glacier Flexible) — archive
    365+ days:   Deep archive (S3 Glacier Deep Archive) — compliance

  Deduplication and compression:
    Backups: 3:1 to 10:1 dedup ratio typical
    VM images: 2:1 to 5:1 dedup ratio
    Logs: 5:1 to 10:1 compression ratio
    Estimated savings: 40–60% with proper lifecycle management

  Wasted storage:
    Unattached volumes:           [X] TB ($[Y]/month wasted)
    Old snapshots:                [X] TB ($[Y]/month wasted)
    Empty buckets/containers:     [X] buckets
    Duplicate files:              [X] TB (identify with dedup tools)
    Oversized disks:              [X] disks > 80% free space
    Total reclaimable:            [X] TB ($[Y]/month savings)

  Storage alerts:
    70% utilization:  WARNING — begin planning expansion
    80% utilization:  PLANNING — submit procurement request
    90% utilization:  CRITICAL — immediate action required
    95% utilization:  EMERGENCY — emergency procurement; may impact operations

Network Capacity Planning

NETWORK CAPACITY FRAMEWORK
============================

WAN Capacity:

  Current WAN links:
    Primary ISP:         [X] Gbps, average [Y]%, peak [Z]%
    Secondary ISP:       [X] Gbps, average [Y]%, peak [Z]%
    Backup (4G/5G):     [X] Mbps (emergency only)
    Direct Connect:      [X] Gbps (cloud interconnect)

  Utilization trend:
    Month    Avg Util    Peak (95th)   Growth     Status
    ──────   ──────────  ───────────   ─────────  ────────
    Jan      45%         65%           baseline   🟢
    Apr      52%         72%           +2%/mo     🟡 Watch
    Jul      60%         80%           +2.7%/mo   🟡 Plan
    Oct      68%         88%           +2.7%/mo   🔴 Act

  Projected: 80% average utilization in [X] months
  Recommendation: upgrade to [Y] Gbps or add secondary [Z] Gbps link
  Budget: $[X]/month for upgraded link

LAN/Datacenter Network:

  Core switch capacity:
    Switching capacity:   [X] Tbps
    Port density:         [X] ports (1G/10G/25G/40G/100G)
    Utilization:          [X]% of switching capacity
    Available ports:      [X] (for new servers)

  Top-of-rack (ToR) switches:
    Switches:             [X] × 48-port 10G/25G switches
    Uplinks:              [X] × 40G/100G to core
    Uplink utilization:   [X]% (alert > 70%)
    Available ports:      [X] per switch average

  Network growth drivers:
    New servers:          [X] planned → [Y] additional switch ports needed
    Bandwidth growth:     [X]% per quarter (video, backups, replication)
    Cloud traffic:        [X] Gbps to cloud (growing [Y]% per quarter)

Wireless Capacity:

  Access points:          [X] APs covering [Y] sq ft
  Clients per AP:         Average [Z] (target: < 30 for performance)
  Channel utilization:    Average [X]% (target: < 70%)
  Growth:                 [X] new clients per quarter
  Upgrade plan:           Wi-Fi 6/6E/7 upgrade in [X] months

Cloud Capacity Planning

CLOUD CAPACITY AND COST FORECASTING
=====================================

Current Cloud Spend:

  Provider    Monthly     Annual    YoY Growth    % of Budget    Trend
  ─────────   ──────────  ────────  ────────────  ─────────────  ─────
  AWS         $45,000     $540,000  +25%          52%            ↑↑
  Azure       $25,000     $300,000  +15%          28%            ↑
  GCP         $10,000     $120,000  +40%          11%            ↑↑↑
  Other       $5,000      $60,000   +10%          9%             →
  ────────────────────────────────────────────────────────────────────────
  Total       $85,000     $1,020,000 +22%         100%           ↑

Cost by service category:

  Category            Monthly     % of Total    Growth Rate    Optimization
  ──────────────────  ──────────  ────────────  ────────────   ──────────────
  Compute (EC2/VMs)   $30,000     35%           +20%           Right-size, RI
  Database (RDS/etc)  $15,000     18%           +25%           Read replicas
  Storage (S3/Blob)   $10,000     12%           +30%           Lifecycle policies
  Network (data xfer) $8,000      9%            +35%           VPC endpoints
  Container (EKS/etc) $7,000      8%            +40%           Auto-scaling
  Managed Services    $8,000      9%            +15%           Evaluate build vs buy
  Other               $7,000      9%            +10%           Review quarterly

Capacity projection (12 months):

  Quarter    Compute    Storage    Network    Total      Headroom    Action
  ────────   ─────────  ─────────  ─────────  ─────────  ──────────  ────────
  Q1        $30K       $10K       $8K        $48K       44%         —
  Q2        $33K       $11K       $9K        $53K       38%         —
  Q3        $36K       $12K       $10K       $58K       32%         ⚠️
  Q4        $40K       $14K       $12K       $64K       24%         🔴

  Budget cap: $65K/month (current)
  Gap: Q4 projected at $64K — near budget cap
  Action: implement cost optimization initiatives by Q2
  Potential savings: $10K–$15K/month with RI + right-sizing + lifecycle

Cloud capacity optimization:

  Reserved Instances / Savings Plans:
    Current coverage:    [X]% of eligible spend
    Target coverage:     > 75% for stable workloads
    Savings potential:   $[X]/month (15–30% discount)
    Implementation:      commit to 1-year or 3-year terms

  Auto-scaling:
    Current:             [X] instances running 24/7
    With auto-scaling:   [Y] instances during off-hours
    Savings:             [X-Y] instances × $[Z]/month

  Right-sizing:
    Over-provisioned:    [X] instances (CPU < 20% for 30 days)
    Right-size candidates: [X] instances → smaller types
    Savings:             $[Y]/month

  Spot instances:
    Eligible workloads:  [X]% (batch, CI/CD, testing, stateless)
    Savings:             60–90% vs. on-demand
    Risk:                instance interruption (handle gracefully)

Integration Points

Datadog / New Relic: Real-time infrastructure monitoring; utilization trends; forecasting; anomaly detection
CloudHealth / Cloudability: Cloud cost management; rightsizing recommendations; reserved instance optimization; anomaly detection
ServiceNow Capacity Management: Integrated with CMDB; trend analysis; threshold management; forecasting
VMware vRealize Operations: vSphere capacity planning; predictive analytics; right-sizing; what-if scenarios
Azure Capacity Optimization / AWS Compute Optimizer: Cloud-native right-sizing recommendations; utilization analysis
SolarWinds Storage Resource Monitor: Storage capacity tracking; growth forecasting; threshold alerting
NetBox / Device42: Infrastructure inventory; capacity tracking; documentation
Custom dashboards (Grafana, Power BI): Capacity visualization; trend analysis; executive reporting

Edge Cases

Seasonal capacity spikes (e-commerce holidays, quarterly reporting, tax season): Plan for peak capacity (not average); implement auto-scaling for cloud; maintain standby capacity for on-prem; pre-provision 2–4 weeks before peak; decommission/release after peak; monitor and right-size during peak
E-commerce: Black Friday/Cyber Monday = 3–10x normal traffic; pre-scale 2 weeks before; maintain for 1 week after
Financial: month-end close = 5–10x batch processing; schedule batch jobs to distribute load
Tax season: 3–5x normal for financial systems; additional capacity February–April

Rapid growth / scaling startup (10x user growth in 12 months): Cloud-first architecture for elasticity; auto-scaling from day one; capacity planning monthly (not quarterly); budget with 50–100% growth buffer; implement FinOps practices early; hire DevOps/SRE for infrastructure automation
Architecture: microservices, containers, auto-scaling, serverless where possible
Monitoring: real-time capacity dashboards; automated scaling policies; cost anomaly detection
Budget: 3x projected spend as budget ceiling; monthly budget review

Capacity constraints with budget freeze (business won't approve new spending): Optimize existing resources aggressively: consolidate servers, implement deduplication, right-size all resources, eliminate waste, negotiate cloud discounts; prioritize critical workloads; defer non-essential projects; implement strict chargeback/showback
Quick wins: terminate idle resources (save 10–20% cloud spend), consolidate underutilized servers (save 20–30%), implement storage lifecycle (save 30–50% storage)
Communication: show business impact of capacity shortage (revenue at risk, SLA breach)

Multi-cloud capacity (workloads across AWS, Azure, GCP): Unified capacity dashboard across providers; consistent monitoring tool; cross-cloud load balancing; avoid vendor lock-in for capacity flexibility; negotiate enterprise agreements with each provider; consolidate billing and reporting
Tooling: CloudHealth, Cloudability, or custom Terraform-based dashboard
Strategy: primary cloud for core workloads; secondary for disaster recovery and cost optimization
Cost: multi-cloud adds 10–20% management overhead; ensure business justification

Edge/IoT capacity (thousands of edge devices): Lightweight monitoring agents; edge aggregation (reduce cloud data transfer); predictive maintenance (replace before failure); over-the-air update management; device lifecycle automation; battery life consideration for monitoring frequency
Architecture: edge gateway per 100 devices; gateway handles local processing and aggregation
Monitoring: heartbeat every 5–60 minutes; event-driven reporting for anomalies
Capacity: plan for 10–20% device replacement rate per year; maintain spare inventory

Disaster recovery capacity (standby capacity for failover): Pilot light (minimal capacity, scale on failover) vs. warm standby (partial capacity) vs. hot standby (full capacity); cost tradeoff: hot standby = 50–100% of production cost; pilot light = 10–20%
RTO/RPO alignment: shorter RTO requires more standby capacity
Testing: quarterly failover test (validate capacity and procedures)
Cloud advantage: cloud DR significantly cheaper than on-prem DR (pay for what you use)

Disclaimer: All rights reserved by Circulos AI. These skills are specifically designed for Claude Code, Claude Cowork, Codex, and OpenClaw. When using or referencing any skill, please provide proper attribution to Circulos AI.