IT AI Skill

Network Performance Monitoring

Monitor network health, bandwidth utilization, latency, and performance across all network infrastructure. Use when diagnosing network issues, tracking bandwidth usage, monitoring network device health, detecting network anomalies, optimizing network performance, planning network capacity, or investigating connectivity problems. Triggers on phrases like "network monitoring", "bandwidth monitoring", "network performance", "latency monitoring", "packet loss", "network diagnostics", "network capacity", "SNMP monitoring", "NetFlow analysis", "network health".

Network Performance Monitoring

Comprehensive monitoring of network infrastructure performance, capacity, and reliability.

Workflow

Discover and inventory all network devices (routers, switches, firewalls, load balancers, WAPs).
Deploy monitoring agents/SNMP collectors on all network infrastructure.
Configure baseline metrics collection: bandwidth, latency, packet loss, error rates, device health.
Implement NetFlow/sFlow/IPFIX collection for application-level traffic analysis.
Build network topology maps (auto-discovery + manual validation).
Create performance dashboards: real-time, historical trends, SLA compliance.
Configure alerts for threshold breaches: bandwidth > 80%, latency > baseline × 2, packet loss > 1%.
Implement anomaly detection: DDoS patterns, unusual traffic volumes, routing anomalies.
Conduct weekly network performance reviews and monthly capacity planning.
Generate quarterly network health reports for stakeholders.

Network Device Monitoring

NETWORK DEVICE MONITORING MATRIX
=================================

Routers (Edge/Core):
  Metrics to collect:
    CPU utilization:        Alert > 75%, Critical > 90%
    Memory utilization:     Alert > 80%, Critical > 95%
    Interface utilization:  Alert > 70%, Critical > 85% (per interface)
    Route table size:       Alert if growing > 10% weekly (potential route leak)
    Uptime:                 Track; alert on reboot
    BGP sessions:           Alert on session flap/down (peer count vs expected)
    NTP sync status:        Alert if unsynchronized (critical for logging/analysis)

  SNMP OIDs commonly monitored:
    1.3.6.1.2.1.1.3.0         — sysUpTime
    1.3.6.1.2.1.2.2.1.10      — ifInOctets (bytes in)
    1.3.6.1.2.1.2.2.1.16      — ifOutOctets (bytes out)
    1.3.6.1.2.1.2.2.1.13      — ifInDiscards
    1.3.6.1.2.1.2.2.1.19      — ifOutDiscards
    1.3.6.1.2.1.2.2.1.3       — ifInErrors
    1.3.6.1.2.1.2.2.1.14      — ifOutErrors

Switches (Access/Distribution/Aggregation):
  Metrics to collect:
    Port utilization:        Alert > 70% per port
    Port errors:             Alert on any CRC errors (indicates physical issue)
    Port flaps:              Alert on > 3 state changes per hour
    Spanning Tree:           Alert on topology change (TCN)
    VLAN status:             Alert on unexpected VLAN changes
    ARP table size:          Alert if growing abnormally (ARP scan or DoS)
    PoE budget:              Alert > 80% (if PoE switches for phones/APs)
    Power supply status:     Alert on PSU failure (redundancy lost)

  Port monitoring best practices:
    - Monitor every access port (not just uplinks)
    - Alert on ports with consistent > 50% utilization (potential loop or misconfig)
    - Track port speed/duplex mismatches (half-duplex on Gigabit = performance issue)
    - Monitor storm control events (broadcast, multicast, unicast storms)

Firewalls:
  Metrics to collect:
    Session count:           Alert > 80% of max sessions (capacity approaching)
    Throughput:              Alert > 70% of rated throughput (includes inspection)
    CPU/Memory:              Alert > 75%
    Rule hit counts:         Track top 10 rules; identify unused rules quarterly
    Threat detection events: Alert on any high-severity threats
    VPN tunnel status:       Alert on tunnel down (site-to-site, remote access)
    SSL inspection load:     Alert if SSL inspection causing > 200ms latency
    License expiration:      Alert 90, 60, 30 days before expiry

  Firewall performance note:
    - Throughput varies significantly by feature set enabled
    - Example: Palo Alto PA-3260 rated at 5Gbps throughput
      * Without features: 5 Gbps
      * With IDS + URL filtering: 2.5 Gbps
      * With full inspection: 1 Gbps
    - Monitor actual throughput vs. rated throughput with current feature set

Load Balancers (F5, HAProxy, Nginx, ALB/NLB):
  Metrics to collect:
    Active connections:      Alert > 70% of max
    Connection rate:         Track new connections per second
    Response time:           Alert > baseline × 2
    Backend health:          Alert on any backend server marked down
    SSL termination:         Track handshake rate, certificate expiry
    Pool member utilization: Alert on uneven distribution (> 30% variance)
    Throughput:              Alert > 80% of capacity

  Load balancing health checks:
    - HTTP/HTTPS health check every 10 seconds
    - TCP health check every 5 seconds
    - Failure threshold: 3 consecutive failures → mark down
    - Recovery threshold: 2 consecutive successes → mark up

Bandwidth and Traffic Analysis

BANDWIDTH MONITORING FRAMEWORK
================================

WAN Link Monitoring:

  Typical enterprise WAN configuration:
    Primary ISP:             1 Gbps dedicated (cost: $1,000–$5,000/month)
    Secondary ISP:           500 Mbps dedicated (cost: $500–$2,500/month)
    Backup (4G/5G):         100 Mbps (cost: $100–$300/month)
    Internet breakout:       Direct or via cloud (AWS Direct Connect, Azure ExpressRoute)

  Monitoring per WAN link:
    - Inbound bandwidth (Mbps): 5-minute average, peak, 95th percentile
    - Outbound bandwidth (Mbps): 5-minute average, peak, 95th percentile
    - Utilization percentage: vs. provisioned capacity
    - Packet loss: < 0.1% acceptable, > 1% requires investigation
    - Latency: < 20ms domestic, < 100ms international (typical)
    - Jitter: < 10ms for VoIP/video, < 30ms general traffic

  Utilization thresholds:
    🟢 Normal:  < 60% average (headroom for spikes)
    🟡 Warning:  60–75% average (monitor for growth)
    🔴 Critical: 75–85% average (plan upgrade)
    ⚫ Emergency: > 85% (congestion likely, upgrade immediately)

  95th percentile billing:
    - Many ISPs bill on 95th percentile usage
    - Top 5% of peak usage is free
    - Monitor to optimize billing
    - Example: 1 Gbps link, 95th percentile = 400 Mbps
      * Billable: 400 Mbps (not 1 Gbps)
      * Can safely utilize to 600 Mbps for short periods

LAN/WLAN Monitoring:

  Switched LAN:
    - Uplink utilization: monitor all distribution-to-core uplinks
    - Access switch port utilization: identify bandwidth-heavy segments
    - Inter-VLAN routing: monitor routing throughput on L3 switches
    - Broadcast traffic: alert if > 5% of total traffic (indicates problem)

  Wireless LAN (Wi-Fi):
    - AP utilization: CPU, memory, associated client count per AP
    - Client count per AP: target < 30 clients/AP (performance degrades above)
    - Signal strength (RSSI): target > -67 dBm for good connectivity
    - Channel utilization: < 70% per channel (avoid co-channel interference)
    - Client throughput: average per client (typical: 50–200 Mbps on Wi-Fi 6)
    - Roaming success rate: > 95% (client moves between APs seamlessly)

NetFlow/sFlow/IPFIX Analysis:

  What NetFlow shows:
    - Top talkers (source/destination IPs generating most traffic)
    - Top protocols (TCP, UDP, ICMP, application protocols)
    - Top port pairs (which services are most used)
    - Geographic distribution of traffic
    - Application identification (with deep packet inspection)

  NetFlow deployment:
    - Enable on all core/distribution router and switch interfaces
    - Export to collector (SolarWinds NTA, Plixer, Kentik, Elastic)
    - Sampling rate: 1:100 on 10Gbps links, 1:10 on 1Gbps links, 1:1 on 100Mbps
    - Retention: 30 days detailed, 1 year aggregated (monthly summaries)

  Traffic analysis use cases:
    1. Capacity planning: identify growth trends, forecast when to upgrade links
    2. Security analysis: detect DDoS, data exfiltration, lateral movement
    3. Application performance: identify bandwidth-heavy applications
    4. Cost optimization: right-size WAN links based on actual usage
    5. Compliance: monitor sensitive data transfers

Network Latency and Diagnostics

LATENCY MONITORING AND DIAGNOSTICS
====================================

Latency measurement methods:

  1. Ping (ICMP Echo):
     - Measure round-trip time (RTT) to key destinations
     - Frequency: every 30 seconds for critical paths, every 5 minutes for others
     - Alert: RTT > 2× baseline, or > 3 consecutive timeouts
     - Targets:
       * Intra-datacenter: < 1ms
       * Intra-region (same city): < 5ms
       * Inter-region (same country): < 20ms
       * Intercontinent: < 100ms
       * Internet (average): < 50ms

  2. Traceroute:
     - Map path to destination (hop-by-hop)
     - Run on-demand during incident investigation
     - Identify routing anomalies, asymmetric routing, black holes
     - Automated: daily to top 20 destinations (track route changes)

  3. Synthetic HTTP/HTTPS Testing:
     - Simulate user requests to critical web applications
     - Measure DNS resolution time, TCP handshake, SSL handshake, TTFB
     - Frequency: every 60 seconds for critical services
     - Tools: ThousandEyes, Catchpoint, Planet Labs (now Skylink)

  4. Application-Level Latency:
     - API endpoint response time monitoring
     - Database query latency from application servers
     - Service-to-service communication latency (microservices)
     - End-user experience metrics (RUM — Real User Monitoring)

Latency baseline by environment:

  Datacenter internal:
    Same rack:              0.01–0.1ms
    Same datacenter:        0.1–1ms
    Same region (cloud):    0.5–2ms
    Cross-region (cloud):   10–50ms (AWS us-east-1 to us-west-2 ≈ 50ms)

  Internet:
    Same city:              1–10ms
    Same country:           10–40ms
    Cross-continent:        50–200ms
    Global (worst case):    100–300ms

  Cloud to cloud:
    AWS to Azure (same region):  10–30ms
    AWS to GCP (same region):    5–20ms
    AWS Direct Connect:          1–5ms (dedicated connection)

Diagnostic procedures for common network issues:

  Issue: Intermittent connectivity
    1. Continuous ping test (mtr to destination showing per-hop stats)
    2. Check interface error counters (CRC errors, collisions)
    3. Check for duplex mismatch (show interface | include Duplex)
    4. Check for physical layer issues (cable, SFP module)
    5. Check for STP blocking ports
    6. Check for ACL/route changes coinciding with issue onset

  Issue: High latency
    1. Ping test from multiple points (isolate: local network, WAN, destination)
    2. Check bandwidth utilization (is link saturated?)
    3. Check routing path (traceroute, BGP path analysis)
    4. Check for microbursts (port buffer overruns)
    5. Check for MTU issues (path MTU discovery, fragmentation)
    6. Check QoS policies (is traffic being deprioritized?)

  Issue: Packet loss
    1. Identify where loss occurs (per-hop ping/mtr)
    2. Check interface discard/error counters
    3. Check for congestion (buffer overflows, queue drops)
    4. Check for MTU mismatches (packets dropped due to fragmentation)
    5. Check for security device drops (firewall, IPS)
    6. Check for hardware issues (faulty SFP, cable)

Network Topology and Mapping

NETWORK TOPOLOGY MANAGEMENT
==============================

Topology documentation requirements:

  Physical topology:
    - Datacenter rack diagrams (device placement, cable connections)
    - Floor plans with device locations
    - Cable labels and patch panel mappings
    - Power connections (PDU, circuit breakers)
    - Tool: Visio, Lucidchart, Draw.io

  Logical topology:
    - Network diagram (layers: access, distribution, core)
    - VLAN assignments and IP addressing scheme
    - Routing protocols (OSPF, BGP, EIGRP) with area/autonomous system design
    - Firewall zones and trust relationships
    - DNS infrastructure (primary, secondary, conditional forwarding)
    - DHCP scope assignments
    - Tool: SolarWinds Network Topology Mapper, PRTG Network Monitor

  Cloud network topology:
    - VPC/VNet architecture (subnets, route tables, NAT gateways)
    - Transit gateway / virtual network gateway connections
    - Direct Connect / ExpressRoute circuits
    - Cloud firewall rules and security groups
    - Load balancer architecture
    - Tool: AWS VPC Lattice, Azure Network Watcher, cloud provider consoles

Auto-discovery protocols:
  CDP (Cisco Discovery Protocol)     — Cisco devices only
  LLDP (Link Layer Discovery Protocol) — Vendor-neutral, recommended
  SNMP (v2c/v3)                     — Device info, interface status, routing table
  ARP table analysis                — Layer 2 connectivity mapping

Topology change management:
  - Daily topology scans (detect new devices, link changes)
  - Alert on unauthorized device connections
  - Change approval workflow for network modifications
  - Before/after documentation for every change
  - Version control for network diagrams

Network Capacity Planning

NETWORK CAPACITY PLANNING FRAMEWORK
=====================================

Capacity planning methodology:

  Step 1: Current baseline (monthly assessment)
    - Average bandwidth utilization per link: [X]%
    - Peak bandwidth utilization per link: [Y]% (95th percentile)
    - Growth rate: [Z]% per quarter
    - Current headroom: [100 - Y]%

  Step 2: Traffic growth projection
    - Historical growth: average [X]% per quarter over past 2 years
    - Business-driven growth: [Y]% from new initiatives (M&A, new services)
    - Technology-driven growth: [Z]% from 4K video, IoT, etc.
    - Combined projected growth: [X + Y + Z]% per quarter

  Step 3: Capacity threshold analysis
    - Current capacity: [X] Mbps/Gbps
    - 80% utilization threshold: [X × 0.8] (trigger for planning upgrade)
    - Time to 80% threshold: [N] quarters at current growth rate
    - Recommended upgrade timeline: [N+1] quarters (plan before needed)

  Step 4: Upgrade options analysis
    Option A: Increase existing link capacity
      - Current: 1 Gbps → Upgrade: 10 Gbps
      - Cost: $3,000–$15,000/month (depends on ISP, location)
      - Lead time: 2–8 weeks
      - Minimal disruption (non-disruptive upgrade possible)

    Option B: Add redundant link
      - Add second 1 Gbps link from different ISP
      - Cost: $1,000–$5,000/month
      - Provides redundancy + 100% capacity increase
      - Load balance or active/standby

    Option C: SD-WAN optimization
      - Combine multiple lower-cost links (MPLS + broadband + 4G/5G)
      - Application-aware routing (critical apps over best path)
      - Cost savings: 30–50% vs. pure MPLS
      - Complexity increase: moderate (requires SD-WAN vendor: VMware, Fortinet, Viptela)

  Example calculation:

    Current: 1 Gbps link, average 45%, peak 75% (95th percentile)
    Growth: 15% per quarter

    Quarter 1: 45% → 51.75% (Q1 peak: 86.25%) — WARNING approaching 80% threshold
    Quarter 2: 51.75% → 59.5% (Q2 peak: 98.9%) — CRITICAL, will congest
    Quarter 3: 59.5% → 68.4% (Q3 peak: 113%) — OVER CAPACITY

    Recommendation: Upgrade to 10 Gbps within 1 quarter
    Cost-benefit: $5,000/month additional cost vs. $500K/hour revenue impact of outage

  Monitoring indicators for capacity planning:

    🟢 Healthy:
      - Average utilization < 50%
      - Peak (95th) utilization < 70%
      - Growth rate < 10% per quarter

    🟡 Plan Ahead:
      - Average utilization 50–65%
      - Peak utilization 70–80%
      - Growth rate 10–20% per quarter

    🔴 Act Now:
      - Average utilization > 65%
      - Peak utilization > 80%
      - Growth rate > 20% per quarter
      - Packet loss > 0.1% on any link

Network Performance Dashboards

NETWORK PERFORMANCE DASHBOARD LAYOUT
======================================

Dashboard 1: Executive Overview (for IT Director/VP)
  - Network uptime: [99.9X]% (last 30 days)
  - Total bandwidth: [X] Gbps provisioned, [Y] Gbps average utilized
  - Active alerts: [number] (🟢 < 3, 🟡 3–10, 🔴 > 10)
  - Top issues this week: [list]
  - SLA compliance: [X]% (target: 99.9%)
  - Monthly trend: bandwidth usage, incidents, SLA

Dashboard 2: Real-Time Network Health (for Network Team)
  - WAN links: real-time bandwidth per link (bar chart, color-coded by utilization)
  - Core router/switch health: CPU, memory, uptime (table)
  - Active BGP sessions: [X/Y] up
  - Firewall session count: [X] of [max]
  - Active threats: [number] blocked today
  - Top 10 talkers: bandwidth by source IP

Dashboard 3: Latency and Connectivity (for Operations)
  - Latency to critical destinations (time series, 24-hour view)
  - Packet loss by link (heatmap)
  - DNS resolution time (average, p95, p99)
  - VPN tunnel status (up/down, throughput)
  - Wireless AP health (connected clients, signal quality)

Dashboard 4: Capacity and Trends (for Planning)
  - Bandwidth utilization trends (30/60/90-day views)
  - Growth rate by link (quarter-over-quarter)
  - Projected capacity exhaustion dates
  - 95th percentile utilization by link (billing optimization)
  - Peak vs. average utilization ratio (right-sizing indicator)

Integration Points

SolarWinds Network Performance Monitor (NPM): Comprehensive network monitoring; auto-discovery, topology mapping, alerting, reporting; SNMP-based
PRTG Network Monitor: All-in-one monitoring; packet sniffing, flow monitoring, wireless monitoring; flexible sensor-based pricing
SolarWinds Network Traffic Analyzer (NTA): NetFlow/sFlow analysis; top talkers, application visibility, bandwidth hogs
Kentik: Cloud-based network performance monitoring; global visibility; real-time traffic analysis; DDoS detection
ThousandEyes: Internet and SaaS performance monitoring; agentless; global vantage points; user-experience focused
Wireshark: Packet analysis (on-demand, not continuous); deep protocol analysis; troubleshooting complex issues
Cisco DNA Center / Meraki Dashboard: Cisco-specific network management; AI-driven insights; automated remediation
Palo Alto Networks PAN-OS / FortiGate GUI: Firewall-specific monitoring; threat intelligence; traffic analysis
Datadog Network Performance Monitoring: Unified network monitoring integrated with application and infrastructure metrics
Smokeping / MRTG: Open-source latency and bandwidth monitoring; simple, reliable

Edge Cases

Microbursts (sub-second traffic spikes that average monitoring misses): Standard 5-minute polling misses sub-second bursts; use port-level statistics (TCAM counters) that capture microburst overflows; enable flow-based monitoring for sub-second granularity; consider hardware solutions: buffer-optimized switches, ETS (Enhanced Transmission Selection) priority queuing
Impact: intermittent packet loss, TCP retransmissions, application timeouts
Detection: check interface overrun/discard counters (not SNMP averages)
Mitigation: increase switch buffer sizes, enable flow-based rate limiting, QoS prioritization

Asymmetric routing (traffic takes different paths in each direction): Causes stateful firewall issues, load balancer problems, monitoring confusion; detect by comparing traceroute in both directions; fix by ensuring consistent routing (equal-cost multi-path hashing, route policy)
Detection: traceroute from A→B differs from B→A; firewall drops return traffic
Fix: implement ECMP with consistent hash; ensure firewall policy allows return traffic from any path; verify BGP route preference

Cloud network performance variability (shared infrastructure in AWS/Azure/GCP): Cloud networking is shared tenancy — performance varies by instance type, region load, time of day; use dedicated network interfaces (ENAs, Accelerated Networking); avoid cold-start latency with pre-warmed connections; test performance baselines per region and instance type
AWS: Nitro-based instances have consistent 10Gbps+ networking; older instances may throttle
Azure: Premium SSD + Accelerated Networking for consistent performance
GCP: Sophisticated network provides consistent performance; choose custom machine types for optimal networking

IoT/edge device network monitoring (thousands of small devices): Standard SNMP monitoring doesn't scale to 10,000+ IoT devices; use lightweight protocols (CoAP, MQTT); implement edge aggregation gateways; monitor at gateway level with periodic device heartbeat; use time-series databases optimized for high cardinality
Gateway approach: 1 gateway per 100–500 IoT devices; gateway reports aggregate metrics
Heartbeat: devices send heartbeat every 5–60 minutes (battery life tradeoff)
Anomaly detection: ML models trained on normal IoT device behavior

Multi-tenant network monitoring (hosting/SaaS environments): Isolate monitoring data per tenant (privacy requirement); implement per-tenant bandwidth quotas and monitoring; alert on tenant-level anomalies without exposing cross-tenant data; use network segmentation (VLANs, VRFs) for isolation
VRF (Virtual Routing and Forwarding): separate routing tables per tenant
Per-tenant NetFlow collection and analysis
Bandwidth policing: rate limit per tenant to prevent noisy neighbor impact

Network monitoring during maintenance windows (expected changes cause false alerts): Suppress alerts during approved maintenance windows; document expected changes and metric impact; verify post-maintenance baseline; auto-resolve alerts opened during maintenance if metrics return to normal
Maintenance window registration: 24-hour advance notice in monitoring system
Alert suppression: disable specific alerts for specific devices during window
Post-maintenance verification: automated check that all metrics returned to baseline

Disclaimer: All rights reserved by Circulos AI. These skills are specifically designed for Claude Code, Claude Cowork, Codex, and OpenClaw. When using or referencing any skill, please provide proper attribution to Circulos AI.