IT AI Skill
Network Performance Monitoring
Monitor network health, bandwidth utilization, latency, and performance across all network infrastructure. Use when diagnosing network issues, tracking bandwidth usage, monitoring network device health, detecting network anomalies, optimizing network perfor...
Network Performance Monitoring
Comprehensive monitoring of network infrastructure performance, capacity, and reliability.
Workflow
- Discover and inventory all network devices (routers, switches, firewalls, load balancers, WAPs).
- Deploy monitoring agents/SNMP collectors on all network infrastructure.
- Configure baseline metrics collection: bandwidth, latency, packet loss, error rates, device health.
- Implement NetFlow/sFlow/IPFIX collection for application-level traffic analysis.
- Build network topology maps (auto-discovery + manual validation).
- Create performance dashboards: real-time, historical trends, SLA compliance.
- Configure alerts for threshold breaches: bandwidth > 80%, latency > baseline × 2, packet loss > 1%.
- Implement anomaly detection: DDoS patterns, unusual traffic volumes, routing anomalies.
- Conduct weekly network performance reviews and monthly capacity planning.
- Generate quarterly network health reports for stakeholders.
Network Device Monitoring
NETWORK DEVICE MONITORING MATRIX
=================================
Routers (Edge/Core):
Metrics to collect:
CPU utilization: Alert > 75%, Critical > 90%
Memory utilization: Alert > 80%, Critical > 95%
Interface utilization: Alert > 70%, Critical > 85% (per interface)
Route table size: Alert if growing > 10% weekly (potential route leak)
Uptime: Track; alert on reboot
BGP sessions: Alert on session flap/down (peer count vs expected)
NTP sync status: Alert if unsynchronized (critical for logging/analysis)
SNMP OIDs commonly monitored:
1.3.6.1.2.1.1.3.0 — sysUpTime
1.3.6.1.2.1.2.2.1.10 — ifInOctets (bytes in)
1.3.6.1.2.1.2.2.1.16 — ifOutOctets (bytes out)
1.3.6.1.2.1.2.2.1.13 — ifInDiscards
1.3.6.1.2.1.2.2.1.19 — ifOutDiscards
1.3.6.1.2.1.2.2.1.3 — ifInErrors
1.3.6.1.2.1.2.2.1.14 — ifOutErrors
Switches (Access/Distribution/Aggregation):
Metrics to collect:
Port utilization: Alert > 70% per port
Port errors: Alert on any CRC errors (indicates physical issue)
Port flaps: Alert on > 3 state changes per hour
Spanning Tree: Alert on topology change (TCN)
VLAN status: Alert on unexpected VLAN changes
ARP table size: Alert if growing abnormally (ARP scan or DoS)
PoE budget: Alert > 80% (if PoE switches for phones/APs)
Power supply status: Alert on PSU failure (redundancy lost)
Port monitoring best practices:
- Monitor every access port (not just uplinks)
- Alert on ports with consistent > 50% utilization (potential loop or misconfig)
- Track port speed/duplex mismatches (half-duplex on Gigabit = performance issue)
- Monitor storm control events (broadcast, multicast, unicast storms)
Firewalls:
Metrics to collect:
Session count: Alert > 80% of max sessions (capacity approaching)
Throughput: Alert > 70% of rated throughput (includes inspection)
CPU/Memory: Alert > 75%
Rule hit counts: Track top 10 rules; identify unused rules quarterly
Threat detection events: Alert on any high-severity threats
VPN tunnel status: Alert on tunnel down (site-to-site, remote access)
SSL inspection load: Alert if SSL inspection causing > 200ms latency
License expiration: Alert 90, 60, 30 days before expiry
Firewall performance note:
- Throughput varies significantly by feature set enabled
- Example: Palo Alto PA-3260 rated at 5Gbps throughput
* Without features: 5 Gbps
* With IDS + URL filtering: 2.5 Gbps
* With full inspection: 1 Gbps
- Monitor actual throughput vs. rated throughput with current feature set
Load Balancers (F5, HAProxy, Nginx, ALB/NLB):
Metrics to collect:
Active connections: Alert > 70% of max
Connection rate: Track new connections per second
Response time: Alert > baseline × 2
Backend health: Alert on any backend server marked down
SSL termination: Track handshake rate, certificate expiry
Pool member utilization: Alert on uneven distribution (> 30% variance)
Throughput: Alert > 80% of capacity
Load balancing health checks:
- HTTP/HTTPS health check every 10 seconds
- TCP health check every 5 seconds
- Failure threshold: 3 consecutive failures → mark down
- Recovery threshold: 2 consecutive successes → mark up
Bandwidth and Traffic Analysis
BANDWIDTH MONITORING FRAMEWORK
================================
WAN Link Monitoring:
Typical enterprise WAN configuration:
Primary ISP: 1 Gbps dedicated (cost: $1,000–$5,000/month)
Secondary ISP: 500 Mbps dedicated (cost: $500–$2,500/month)
Backup (4G/5G): 100 Mbps (cost: $100–$300/month)
Internet breakout: Direct or via cloud (AWS Direct Connect, Azure ExpressRoute)
Monitoring per WAN link:
- Inbound bandwidth (Mbps): 5-minute average, peak, 95th percentile
- Outbound bandwidth (Mbps): 5-minute average, peak, 95th percentile
- Utilization percentage: vs. provisioned capacity
- Packet loss: < 0.1% acceptable, > 1% requires investigation
- Latency: < 20ms domestic, < 100ms international (typical)
- Jitter: < 10ms for VoIP/video, < 30ms general traffic
Utilization thresholds:
🟢 Normal: < 60% average (headroom for spikes)
🟡 Warning: 60–75% average (monitor for growth)
🔴 Critical: 75–85% average (plan upgrade)
⚫ Emergency: > 85% (congestion likely, upgrade immediately)
95th percentile billing:
- Many ISPs bill on 95th percentile usage
- Top 5% of peak usage is free
- Monitor to optimize billing
- Example: 1 Gbps link, 95th percentile = 400 Mbps
* Billable: 400 Mbps (not 1 Gbps)
* Can safely utilize to 600 Mbps for short periods
LAN/WLAN Monitoring:
Switched LAN:
- Uplink utilization: monitor all distribution-to-core uplinks
- Access switch port utilization: identify bandwidth-heavy segments
- Inter-VLAN routing: monitor routing throughput on L3 switches
- Broadcast traffic: alert if > 5% of total traffic (indicates problem)
Wireless LAN (Wi-Fi):
- AP utilization: CPU, memory, associated client count per AP
- Client count per AP: target < 30 clients/AP (performance degrades above)
- Signal strength (RSSI): target > -67 dBm for good connectivity
- Channel utilization: < 70% per channel (avoid co-channel interference)
- Client throughput: average per client (typical: 50–200 Mbps on Wi-Fi 6)
- Roaming success rate: > 95% (client moves between APs seamlessly)
NetFlow/sFlow/IPFIX Analysis:
What NetFlow shows:
- Top talkers (source/destination IPs generating most traffic)
- Top protocols (TCP, UDP, ICMP, application protocols)
- Top port pairs (which services are most used)
- Geographic distribution of traffic
- Application identification (with deep packet inspection)
NetFlow deployment:
- Enable on all core/distribution router and switch interfaces
- Export to collector (SolarWinds NTA, Plixer, Kentik, Elastic)
- Sampling rate: 1:100 on 10Gbps links, 1:10 on 1Gbps links, 1:1 on 100Mbps
- Retention: 30 days detailed, 1 year aggregated (monthly summaries)
Traffic analysis use cases:
1. Capacity planning: identify growth trends, forecast when to upgrade links
2. Security analysis: detect DDoS, data exfiltration, lateral movement
3. Application performance: identify bandwidth-heavy applications
4. Cost optimization: right-size WAN links based on actual usage
5. Compliance: monitor sensitive data transfers
Network Latency and Diagnostics
LATENCY MONITORING AND DIAGNOSTICS
====================================
Latency measurement methods:
1. Ping (ICMP Echo):
- Measure round-trip time (RTT) to key destinations
- Frequency: every 30 seconds for critical paths, every 5 minutes for others
- Alert: RTT > 2× baseline, or > 3 consecutive timeouts
- Targets:
* Intra-datacenter: < 1ms
* Intra-region (same city): < 5ms
* Inter-region (same country): < 20ms
* Intercontinent: < 100ms
* Internet (average): < 50ms
2. Traceroute:
- Map path to destination (hop-by-hop)
- Run on-demand during incident investigation
- Identify routing anomalies, asymmetric routing, black holes
- Automated: daily to top 20 destinations (track route changes)
3. Synthetic HTTP/HTTPS Testing:
- Simulate user requests to critical web applications
- Measure DNS resolution time, TCP handshake, SSL handshake, TTFB
- Frequency: every 60 seconds for critical services
- Tools: ThousandEyes, Catchpoint, Planet Labs (now Skylink)
4. Application-Level Latency:
- API endpoint response time monitoring
- Database query latency from application servers
- Service-to-service communication latency (microservices)
- End-user experience metrics (RUM — Real User Monitoring)
Latency baseline by environment:
Datacenter internal:
Same rack: 0.01–0.1ms
Same datacenter: 0.1–1ms
Same region (cloud): 0.5–2ms
Cross-region (cloud): 10–50ms (AWS us-east-1 to us-west-2 ≈ 50ms)
Internet:
Same city: 1–10ms
Same country: 10–40ms
Cross-continent: 50–200ms
Global (worst case): 100–300ms
Cloud to cloud:
AWS to Azure (same region): 10–30ms
AWS to GCP (same region): 5–20ms
AWS Direct Connect: 1–5ms (dedicated connection)
Diagnostic procedures for common network issues:
Issue: Intermittent connectivity
1. Continuous ping test (mtr to destination showing per-hop stats)
2. Check interface error counters (CRC errors, collisions)
3. Check for duplex mismatch (show interface | include Duplex)
4. Check for physical layer issues (cable, SFP module)
5. Check for STP blocking ports
6. Check for ACL/route changes coinciding with issue onset
Issue: High latency
1. Ping test from multiple points (isolate: local network, WAN, destination)
2. Check bandwidth utilization (is link saturated?)
3. Check routing path (traceroute, BGP path analysis)
4. Check for microbursts (port buffer overruns)
5. Check for MTU issues (path MTU discovery, fragmentation)
6. Check QoS policies (is traffic being deprioritized?)
Issue: Packet loss
1. Identify where loss occurs (per-hop ping/mtr)
2. Check interface discard/error counters
3. Check for congestion (buffer overflows, queue drops)
4. Check for MTU mismatches (packets dropped due to fragmentation)
5. Check for security device drops (firewall, IPS)
6. Check for hardware issues (faulty SFP, cable)
Network Topology and Mapping
NETWORK TOPOLOGY MANAGEMENT
==============================
Topology documentation requirements:
Physical topology:
- Datacenter rack diagrams (device placement, cable connections)
- Floor plans with device locations
- Cable labels and patch panel mappings
- Power connections (PDU, circuit breakers)
- Tool: Visio, Lucidchart, Draw.io
Logical topology:
- Network diagram (layers: access, distribution, core)
- VLAN assignments and IP addressing scheme
- Routing protocols (OSPF, BGP, EIGRP) with area/autonomous system design
- Firewall zones and trust relationships
- DNS infrastructure (primary, secondary, conditional forwarding)
- DHCP scope assignments
- Tool: SolarWinds Network Topology Mapper, PRTG Network Monitor
Cloud network topology:
- VPC/VNet architecture (subnets, route tables, NAT gateways)
- Transit gateway / virtual network gateway connections
- Direct Connect / ExpressRoute circuits
- Cloud firewall rules and security groups
- Load balancer architecture
- Tool: AWS VPC Lattice, Azure Network Watcher, cloud provider consoles
Auto-discovery protocols:
CDP (Cisco Discovery Protocol) — Cisco devices only
LLDP (Link Layer Discovery Protocol) — Vendor-neutral, recommended
SNMP (v2c/v3) — Device info, interface status, routing table
ARP table analysis — Layer 2 connectivity mapping
Topology change management:
- Daily topology scans (detect new devices, link changes)
- Alert on unauthorized device connections
- Change approval workflow for network modifications
- Before/after documentation for every change
- Version control for network diagrams
Network Capacity Planning
NETWORK CAPACITY PLANNING FRAMEWORK
=====================================
Capacity planning methodology:
Step 1: Current baseline (monthly assessment)
- Average bandwidth utilization per link: [X]%
- Peak bandwidth utilization per link: [Y]% (95th percentile)
- Growth rate: [Z]% per quarter
- Current headroom: [100 - Y]%
Step 2: Traffic growth projection
- Historical growth: average [X]% per quarter over past 2 years
- Business-driven growth: [Y]% from new initiatives (M&A, new services)
- Technology-driven growth: [Z]% from 4K video, IoT, etc.
- Combined projected growth: [X + Y + Z]% per quarter
Step 3: Capacity threshold analysis
- Current capacity: [X] Mbps/Gbps
- 80% utilization threshold: [X × 0.8] (trigger for planning upgrade)
- Time to 80% threshold: [N] quarters at current growth rate
- Recommended upgrade timeline: [N+1] quarters (plan before needed)
Step 4: Upgrade options analysis
Option A: Increase existing link capacity
- Current: 1 Gbps → Upgrade: 10 Gbps
- Cost: $3,000–$15,000/month (depends on ISP, location)
- Lead time: 2–8 weeks
- Minimal disruption (non-disruptive upgrade possible)
Option B: Add redundant link
- Add second 1 Gbps link from different ISP
- Cost: $1,000–$5,000/month
- Provides redundancy + 100% capacity increase
- Load balance or active/standby
Option C: SD-WAN optimization
- Combine multiple lower-cost links (MPLS + broadband + 4G/5G)
- Application-aware routing (critical apps over best path)
- Cost savings: 30–50% vs. pure MPLS
- Complexity increase: moderate (requires SD-WAN vendor: VMware, Fortinet, Viptela)
Example calculation:
Current: 1 Gbps link, average 45%, peak 75% (95th percentile)
Growth: 15% per quarter
Quarter 1: 45% → 51.75% (Q1 peak: 86.25%) — WARNING approaching 80% threshold
Quarter 2: 51.75% → 59.5% (Q2 peak: 98.9%) — CRITICAL, will congest
Quarter 3: 59.5% → 68.4% (Q3 peak: 113%) — OVER CAPACITY
Recommendation: Upgrade to 10 Gbps within 1 quarter
Cost-benefit: $5,000/month additional cost vs. $500K/hour revenue impact of outage
Monitoring indicators for capacity planning:
🟢 Healthy:
- Average utilization < 50%
- Peak (95th) utilization < 70%
- Growth rate < 10% per quarter
🟡 Plan Ahead:
- Average utilization 50–65%
- Peak utilization 70–80%
- Growth rate 10–20% per quarter
🔴 Act Now:
- Average utilization > 65%
- Peak utilization > 80%
- Growth rate > 20% per quarter
- Packet loss > 0.1% on any link
Network Performance Dashboards
NETWORK PERFORMANCE DASHBOARD LAYOUT
======================================
Dashboard 1: Executive Overview (for IT Director/VP)
- Network uptime: [99.9X]% (last 30 days)
- Total bandwidth: [X] Gbps provisioned, [Y] Gbps average utilized
- Active alerts: [number] (🟢 < 3, 🟡 3–10, 🔴 > 10)
- Top issues this week: [list]
- SLA compliance: [X]% (target: 99.9%)
- Monthly trend: bandwidth usage, incidents, SLA
Dashboard 2: Real-Time Network Health (for Network Team)
- WAN links: real-time bandwidth per link (bar chart, color-coded by utilization)
- Core router/switch health: CPU, memory, uptime (table)
- Active BGP sessions: [X/Y] up
- Firewall session count: [X] of [max]
- Active threats: [number] blocked today
- Top 10 talkers: bandwidth by source IP
Dashboard 3: Latency and Connectivity (for Operations)
- Latency to critical destinations (time series, 24-hour view)
- Packet loss by link (heatmap)
- DNS resolution time (average, p95, p99)
- VPN tunnel status (up/down, throughput)
- Wireless AP health (connected clients, signal quality)
Dashboard 4: Capacity and Trends (for Planning)
- Bandwidth utilization trends (30/60/90-day views)
- Growth rate by link (quarter-over-quarter)
- Projected capacity exhaustion dates
- 95th percentile utilization by link (billing optimization)
- Peak vs. average utilization ratio (right-sizing indicator)
Integration Points
- SolarWinds Network Performance Monitor (NPM): Comprehensive network monitoring; auto-discovery, topology mapping, alerting, reporting; SNMP-based
- PRTG Network Monitor: All-in-one monitoring; packet sniffing, flow monitoring, wireless monitoring; flexible sensor-based pricing
- SolarWinds Network Traffic Analyzer (NTA): NetFlow/sFlow analysis; top talkers, application visibility, bandwidth hogs
- Kentik: Cloud-based network performance monitoring; global visibility; real-time traffic analysis; DDoS detection
- ThousandEyes: Internet and SaaS performance monitoring; agentless; global vantage points; user-experience focused
- Wireshark: Packet analysis (on-demand, not continuous); deep protocol analysis; troubleshooting complex issues
- Cisco DNA Center / Meraki Dashboard: Cisco-specific network management; AI-driven insights; automated remediation
- Palo Alto Networks PAN-OS / FortiGate GUI: Firewall-specific monitoring; threat intelligence; traffic analysis
- Datadog Network Performance Monitoring: Unified network monitoring integrated with application and infrastructure metrics
- Smokeping / MRTG: Open-source latency and bandwidth monitoring; simple, reliable
Edge Cases
- Microbursts (sub-second traffic spikes that average monitoring misses): Standard 5-minute polling misses sub-second bursts; use port-level statistics (TCAM counters) that capture microburst overflows; enable flow-based monitoring for sub-second granularity; consider hardware solutions: buffer-optimized switches, ETS (Enhanced Transmission Selection) priority queuing
- Impact: intermittent packet loss, TCP retransmissions, application timeouts
- Detection: check interface overrun/discard counters (not SNMP averages)
- Mitigation: increase switch buffer sizes, enable flow-based rate limiting, QoS prioritization
- Asymmetric routing (traffic takes different paths in each direction): Causes stateful firewall issues, load balancer problems, monitoring confusion; detect by comparing traceroute in both directions; fix by ensuring consistent routing (equal-cost multi-path hashing, route policy)
- Detection: traceroute from A→B differs from B→A; firewall drops return traffic
- Fix: implement ECMP with consistent hash; ensure firewall policy allows return traffic from any path; verify BGP route preference
- Cloud network performance variability (shared infrastructure in AWS/Azure/GCP): Cloud networking is shared tenancy — performance varies by instance type, region load, time of day; use dedicated network interfaces (ENAs, Accelerated Networking); avoid cold-start latency with pre-warmed connections; test performance baselines per region and instance type
- AWS: Nitro-based instances have consistent 10Gbps+ networking; older instances may throttle
- Azure: Premium SSD + Accelerated Networking for consistent performance
- GCP: Sophisticated network provides consistent performance; choose custom machine types for optimal networking
- IoT/edge device network monitoring (thousands of small devices): Standard SNMP monitoring doesn't scale to 10,000+ IoT devices; use lightweight protocols (CoAP, MQTT); implement edge aggregation gateways; monitor at gateway level with periodic device heartbeat; use time-series databases optimized for high cardinality
- Gateway approach: 1 gateway per 100–500 IoT devices; gateway reports aggregate metrics
- Heartbeat: devices send heartbeat every 5–60 minutes (battery life tradeoff)
- Anomaly detection: ML models trained on normal IoT device behavior
- Multi-tenant network monitoring (hosting/SaaS environments): Isolate monitoring data per tenant (privacy requirement); implement per-tenant bandwidth quotas and monitoring; alert on tenant-level anomalies without exposing cross-tenant data; use network segmentation (VLANs, VRFs) for isolation
- VRF (Virtual Routing and Forwarding): separate routing tables per tenant
- Per-tenant NetFlow collection and analysis
- Bandwidth policing: rate limit per tenant to prevent noisy neighbor impact
- Network monitoring during maintenance windows (expected changes cause false alerts): Suppress alerts during approved maintenance windows; document expected changes and metric impact; verify post-maintenance baseline; auto-resolve alerts opened during maintenance if metrics return to normal
- Maintenance window registration: 24-hour advance notice in monitoring system
- Alert suppression: disable specific alerts for specific devices during window
- Post-maintenance verification: automated check that all metrics returned to baseline