---
name: server-room-datacenter-ops
description: Manage physical server rooms and data center operations including environmental monitoring, power management, rack management, physical security, equipment lifecycle, and capacity planning. Use when managing data center infrastructure, monitoring environmental conditions, planning rack capacity, conducting facility maintenance, managing power redundancy, handling hardware failures, or planning data center expansion. Triggers on phrases like "data center operations", "server room", "rack management", "data center capacity", "UPS", "cooling", "PDU", "facility management", "hot aisle cold aisle", "power distribution".
---

# Server Room & Data Center Operations

Manage physical data center infrastructure for reliability, efficiency, and security.

## Workflow

1. Establish data center inventory: all racks, servers, network devices, power, and cooling equipment.
2. Deploy environmental monitoring: temperature, humidity, water detection, smoke/fire sensors.
3. Configure power monitoring: UPS status, PDU utilization, circuit breaker status, generator fuel.
4. Implement physical security: access control, video surveillance, visitor management, audit trails.
5. Create rack diagrams and capacity maps: U-space utilization, power draw, weight distribution.
6. Establish preventive maintenance schedule: server firmware updates, filter changes, UPS battery testing.
7. Monitor data center KPIs: PUE (Power Usage Effectiveness), power utilization, cooling efficiency.
8. Conduct quarterly capacity planning: power headroom, cooling headroom, space availability.
9. Maintain disaster recovery procedures: fire response, power failure, environmental events.
10. Document all physical changes: equipment moves, adds, changes, and disconnects (MACD).

## Data Center Infrastructure

```
DATA CENTER TIERS AND SPECIFICATIONS
======================================

Uptime Institute Tier Classification:

  Tier I (Basic Capacity):
    - Single path for distribution and cooling
    - No redundant components
    - Availability: 99.671% (28.8 hours downtime/year)
    - Typical cost: $300–$500 per square foot
    - Suitable for: small businesses, non-critical workloads

  Tier II (Redundant Components):
    - Redundant cooling and power components
    - Single path for distribution
    - Availability: 99.741% (22.3 hours downtime/year)
    - Typical cost: $400–$600 per square foot
    - Suitable for: medium businesses, some redundancy needed

  Tier III (Concurrent Maintainability):
    - Multiple paths for distribution and cooling
    - Redundant components
    - Can be maintained without affecting operations
    - Availability: 99.982% (1.6 hours downtime/year)
    - Typical cost: $500–$800 per square foot
    - Suitable for: enterprises, critical business operations

  Tier IV (Fault Tolerant):
    - Multiple independent distribution paths
    - Fully redundant systems (N+2 or 2N)
    - Can survive any single event without impact
    - Availability: 99.995% (0.4 hours downtime/year)
    - Typical cost: $700–$1,200 per square foot
    - Suitable for: financial institutions, healthcare, government

Typical enterprise data center specifications:

  Facility:
    - Total area: 1,000–50,000 sq ft
    - Raised floor height: 12–24 inches (underfloor cooling)
    - Floor loading: 250–500 lbs/sq ft (support heavy equipment)
    - Ceiling height: 14–20 feet (hot aisle containment)

  Power:
    - Utility feeds: dual feeds from separate substations (Tier III+)
    - UPS: N+1 or 2N redundancy; 15-minute to 2-hour battery runtime
    - Generators: diesel, N+1 redundancy; 48-hour fuel supply on-site
    - Automatic transfer switch: < 10 seconds transfer time
    - PDU: A and B feeds per rack (redundant power)

  Cooling:
    - CRAC/CRAH units: N+1 redundancy
    - Hot aisle / cold aisle containment
    - Supply temperature: 64–71°F (18–22°C) ASHRAE recommended
    - Return temperature: 73–87°F (23–31°C)
    - Humidity: 20–80% RH (dew point 55°F / 13°C max)
    - In-row cooling for high-density racks (> 10kW per rack)

  Security:
    - Perimeter: fencing, bollards, vehicle access control
    - Entry: badge readers (2-factor: badge + PIN/biometric)
    - Mantrap / access control vestibule
    - Video surveillance: 24/7, 90+ day retention
    - Environmental: water leak detection, smoke detection, fire suppression
```

## Environmental Monitoring

```
ENVIRONMENTAL MONITORING FRAMEWORK
====================================

Temperature monitoring:

  Sensors deployed:
    - Per rack: inlet (cold) and outlet (hot) temperature sensors
    - Per row: overhead thermal sensors (hot aisle, cold aisle)
    - Floor level: underfloor temperature (supply air)
    - Return air: CRAC/CRAH unit return temperature
    - External: outside air temperature (for economizer reference)

  Temperature thresholds (per ASHRAE 2019 guidelines):

    Location              Acceptable       Recommended       Alert        Critical
    ──────────────────    ─────────────    ─────────────     ──────────   ──────────
    Server inlet          41–95°F         64–81°F           > 81°F       > 95°F
    (supply air)          (5–35°C)        (18–27°C)         (27°C)       (35°C)

    Hot aisle             41–130°F        95–122°F          > 122°F      > 130°F
    (return air)          (5–54°C)        (35–50°C)         (50°C)       (54°C)

    Underfloor            41–104°F        64–86°F           > 86°F       > 104°F
    (plenum)              (5–40°C)        (18–30°C)         (30°C)       (40°C)

  Alert actions:
    Warning (> 81°F): notify facilities team; check CRAC units; investigate hot spots
    Critical (> 95°F): page on-call; potential equipment shutdown risk; activate backup cooling
    Emergency (> 113°F): automatic equipment shutdown (if configured); evacuate if needed

Humidity monitoring:

  Thresholds:
    - Recommended: 20–80% RH
    - Absolute maximum dew point: 55°F (13°C)
    - Alert low: < 20% RH (static electricity risk — ESD can damage components)
    - Alert high: > 80% RH (condensation risk — water damage to electronics)

  Actions:
    Low humidity: activate humidifiers; check for AC issues
    High humidity: activate dehumidifiers; check for water leaks; verify CRAC operation

Water leak detection:

  Sensors:
    - Underfloor detection cables (perimeter and under CRAC units)
    - Drip pans under CRAC/CRAH units with float switches
    - Floor drains with level sensors
    - Pipe leak detection along chilled water lines

  Alert: immediate page to facilities team on ANY water detection
  Response: identify source, activate containment, move critical equipment if needed

Airflow monitoring:

  Metrics:
    - Underfloor static pressure: 0.2–0.5 in. w.c. (too low = poor cooling; too high = door issues)
    - Airflow velocity through rack inlets: minimum 100 FPM (feet per minute)
    - Containment integrity: door seals, blank panels in empty U-space
    - Hot/cold aisle temperature differential: minimum 20°F differential

  Issues:
    - Missing blank panels: hot air recirculates into cold aisle
    - Blocked airflow: cables obstructing front/rear of server
    - Hot spots: high-density racks without adequate cooling
    - Short circuits: supply air not reaching servers (bypass airflow)
```

## Power Management

```
POWER MANAGEMENT FRAMEWORK
============================

Power chain (dual redundant):

  Utility A → Transformer A → UPS A → PDU A → Server PSU A
  Utility B → Transformer B → UPS B → PDU B → Server PSU B

  Each component provides a layer of power protection:
    Utility: dual feeds from separate substations
    Transformer: steps down voltage (13.8kV → 480V)
    UPS: battery backup (15 minutes–2 hours runtime)
    PDU: power distribution to individual outlets in rack
    PSU: server power supply unit (redundant PSUs in enterprise servers)

Power monitoring per rack:

  Measured:
    Phase A (A feed):     [X] A, [Y] W, [Z] VA
    Phase B (B feed):     [X] A, [Y] W, [Z] VA
    Total rack power:      [X] kW
    Power factor:          [0.9–1.0] (target > 0.95)
    Circuit utilization:   [X]% of rated capacity

  Typical rack power specifications:
    Standard rack:         20–30A per PDU @ 120V = 2.4–3.6 kW per feed
    High-density rack:     40–50A per PDU @ 208V = 8.3–10.4 kW per feed
    GPU/AI rack:           15–25 kW per rack (requires special cooling and power)

  Alert thresholds:
    - Single PDU > 80% capacity: WARNING (plan load balancing)
    - Single PDU > 90% capacity: CRITICAL (risk of breaker trip)
    - A/B feed imbalance > 20%: WARNING (uneven load distribution)
    - Power factor < 0.9: WARNING (inefficient, may incur utility penalties)

UPS monitoring:

  Key metrics:
    Battery charge:        Target > 90%; Alert < 80%; Critical < 50%
    Battery runtime:       Current estimate at current load
    UPS load %:            Target < 80%; Alert > 85%; Critical > 90%
    Input voltage:         Must be within acceptable range (100–130V or 200–264V)
    Output voltage:        Must be stable (120V ± 3% or 230V ± 3%)
    Bypass mode:           Alert if on bypass (UPS not protecting, utility directly feeding)
    Temperature:           UPS internal temperature (overheating reduces battery life)

  UPS maintenance:
    - Battery replacement: every 3–5 years (typical lifespan)
    - Battery testing: quarterly impedance/conductance test
    - Full load test: annual (discharge to 20% and recharge)
    - Firmware updates: as recommended by manufacturer
    - Capacitor replacement: every 7–10 years

Generator monitoring:

  Key metrics:
    Fuel level:            Target > 80%; Alert < 50%; Critical < 25%
    Oil pressure:          Must be within operating range
    Coolant temperature:   Must be within operating range
    Battery charge:        Generator start battery (separate from UPS)
    Runtime hours:         Track for maintenance scheduling
    Last test run:         Weekly 15-minute test; monthly 1-hour test

  Maintenance schedule:
    Weekly:  15-minute exercise run (keep engine lubricated)
    Monthly: 1-hour load test (verify capacity under load)
    Quarterly: oil change, filter replacement, full inspection
    Annually: complete overhaul per manufacturer recommendations
    Fuel:     test fuel quality quarterly (diesel degrades over time)
```

## Rack Management and Capacity

```
RACK MANAGEMENT FRAMEWORK
===========================

Standard rack specifications:

  42U rack (most common):
    - Total U-space: 42 rack units (1U = 1.75 inches = 44.45mm)
    - Available U-space: ~36–38 (after accounting for PDUs, switches, cable managers)
    - Width: 19 inches (standard) or 23 inches
    - Depth: 600mm, 800mm, 1000mm, 1200mm (deeper for GPU servers, storage arrays)
    - Weight capacity: 800–1,500 lbs (static), 500–800 lbs (dynamic)

Equipment placement best practices:

  Top of rack (U 42–35):
    - Patch panels (network, fiber)
    - Cable managers
    - PDUs (top-mount)

  Upper middle (U 34–25):
    - Network switches (distribution layer)
    - Firewalls/security appliances
    - Load balancers

  Lower middle (U 24–10):
    - Servers (heaviest equipment — center of gravity low)
    - Place heavier items at bottom for stability

  Bottom (U 9–1):
    - Storage arrays (heavy)
    - CRAC floor tiles (if in-row cooling)
    - Leave bottom U-space open for airflow

Rack capacity tracking:

  Per rack metrics:
    U-space utilized:      [X]/42 U ([Y]%)
    Power utilized (A):    [X]W / [Y]W ([Z]%)
    Power utilized (B):    [X]W / [Y]W ([Z]%)
    Weight:                [X] lbs / [Y] lbs capacity
    Heat output:           [X] kW (for cooling capacity planning)

  Data center capacity summary:

    Total racks:           [X]
    Racks with space:      [Y] (available U-space > 6U)
    Racks with power:      [Z] (available power > 1 kW)
    Racks fully utilized:  [W] (both space and power < 10%)
    Total power capacity:  [X] kW
    Total power utilized:  [Y] kW ([Z]%)
    Total cooling capacity:[X] kW
    Total cooling utilized:[Y] kW ([Z]%)

Rack diagram (example):

    U42  [PDU-A] [PDU-B]
    U41  [Patch Panel]
    U40  [Cable Manager]
    U39  [24-Port Switch]
    U38  [24-Port Switch]
    U37  [Blank Panel]
    U36  [Firewall]
    U35  [Blank Panel]
    U34  [2U Server]
    U32  [2U Server]
    U30  [2U Server]
    U28  [4U Storage Array]
    U24  [2U Server]
    U22  [2U Server]
    U20  [2U Server]
    U18  [2U Server]
    U16  [2U Server]
    U14  [2U Server]
    U12  [2U Server]
    U10  [2U Server]
    U8   [2U Server]
    U6   [Cable Manager]
    U4   [Blank Panel]
    U2   [Blank Panel]
    U1   [Floor]
```

## Physical Security

```
PHYSICAL SECURITY FRAMEWORK
=============================

Access control:

  Badge system:
    - Type: HID iCLASS, Desfire EV2, or biometric (fingerprint/iris)
    - Credential levels:
      Level 1: General IT staff — server room access during business hours
      Level 2: Senior engineers — server room access 24/7
      Level 3: Facility managers — full facility access including mechanical
      Level 4: Executives — visitor escorted access only
    - Two-factor authentication at main entry: badge + PIN or badge + biometric
    - Mantrap (vestibule): only one person at a time; anti-tailgating detection

  Access logging:
    - All entry/exit events logged with: person, time, door, credential used
    - Retention: minimum 1 year (3–7 years for compliance)
    - Audit: review access logs monthly for anomalies
    - Alert: after-hours access, failed badge attempts (> 3), tailgating detection

Video surveillance:

  Camera coverage:
    - All entry/exit points
    - All rack rows (overhead cameras for full coverage)
    - Perimeter fencing
    - Loading dock / delivery area
    - Elevators serving data center floor
  Retention: 90 days minimum (180 days for compliance)
  Resolution: minimum 1080p; license plate readable at perimeter
  Storage: on-site NVR + cloud backup

Visitor management:

  Process:
    1. Pre-approval required (24 hours minimum advance notice)
    2. Visitor signed NDA and security agreement
    3. Temporary badge issued at front desk
    4. Escort required at all times within data center
    5. Badge collected and access revoked at departure
    6. Visit logged with: visitor name, company, purpose, escort, time in/out

  Visitor statistics tracking:
    - Total visitors per month: [X]
    - Average visit duration: [Y] hours
    - Violations (unescorted access): [Z] (target: 0)

Environmental security:

  Fire detection and suppression:
    - VESDA (Very Early Warning Aspirating Smoke Detection) — detects smoke 30–60 min before conventional
    - FM-200 or Novec 1230 gas suppression (no water damage)
    - Pre-action sprinkler system (water only after confirmation)
    - Fire alarm integration with building management system

  Water protection:
    - No water pipes above data center floor (design requirement)
    - CRAC/CRAH units with containment pans
    - Floor drains with backup pumps
    - Flood barriers at entry points
```

## Integration Points

- **DCIM platforms** (Nlyte, Sunbird, Nlyte): Data center infrastructure management; rack diagrams, capacity planning, power/cooling tracking, MACD workflow
- **Environmental monitoring** (SensorPush, Monitron, Raritan/Panduit PX): Temperature, humidity, water, air pressure sensors; SNMP integration; alerting
- **UPS management** (APC Network Management Card, Eaton XML, Vertiv GSM): UPS monitoring via SNMP; battery health; runtime estimation
- **PDU management** (Raritan/Panduit, Legrand, APD Smart PDU): Outlet-level power monitoring; remote reboot; circuit analysis
- **Access control** (Lenel, Genetec, HID): Badge management; video integration; access audit reporting
- **Video surveillance** (Axis, Bosch, Hanwha): Camera management; analytics (tailgating detection, loitering); NVR storage
- **Building management** (Siemens, Honeywell, Schneider): HVAC control; lighting; fire alarm; energy management
- **IT asset management** (Snipe-IT, ServiceNow): Equipment inventory; lifecycle tracking; warranty management; location tracking

## Edge Cases

- **High-density compute** (GPU/AI workloads at 25–50kW per rack): Standard data center designed for 5–10kW per rack; GPU servers require 3–5× more power and cooling; solutions: in-row cooling (Dell PX, Liebert), rear-door heat exchangers, liquid cooling (immersion, direct-to-chip); power: 400V PDUs, dedicated circuits
  - Power requirement: 25kW/rack needs 2× 50A 208V 3-phase circuits
  - Cooling requirement: liquid cooling or in-row CRAC units
  - Floor loading: GPU servers heavier; verify floor capacity (may need reinforcement)
  - Typical deployment: dedicated GPU rack rows with enhanced infrastructure

- **Colocation data center management** (shared facility): Less control over facility operations; rely on provider SLAs; understand shared vs. dedicated components; verify provider's N+1/2N redundancy claims; monitor cross-connect availability; plan for provider maintenance windows
  - Provider selection: Tier III/IV preferred; verify Uptime Institute certification
  - SLA requirements: 99.99%+ uptime; power and cooling credits for outages
  - Cross-connects: fiber connections to other tenants/cloud on-ramps; $50–$500/month per cross-connect
  - Monitoring: request provider's real-time power/cooling dashboards; deploy own environmental sensors

- **Data center migration** (moving to new facility): Plan 6–12 months ahead; inventory all equipment; plan transport (specialized data center movers: RST, Vertiv, DLS); staging area preparation; cut-over during maintenance window; parallel operation during transition; rollback plan
  - Timeline: 6–12 months planning; 2–4 weeks execution; 1–2 weeks validation
  - Cost: $50,000–$500,000+ depending on scale
  - Risk: minimize downtime; test failover before move; document everything
  - Cutover: typically weekend; 48–72 hour maintenance window

- **Power grid instability** (regions with unreliable utility power): Invest in larger UPS capacity (2–4 hours vs. 15 minutes); more frequent generator exercise; power quality monitoring (harmonics, transients); voltage regulators/conditioners; consider solar+battery for partial offset; negotiate power purchase agreements
  - UPS sizing: 2–4 hours for grid-unstable regions (vs. 15 minutes standard)
  - Generator fuel: 72-hour supply minimum (vs. 48 hours standard)
  - Power conditioning: isolate sensitive equipment from grid disturbances
  - Monitoring: real-time power quality analysis; alert on voltage sags/swells

- **Sustainability and energy efficiency** (green data center initiatives): Track and optimize PUE (target: < 1.5 for modern, < 1.3 for best-in-class); implement hot aisle/cold aisle containment; use free cooling (economizers) where climate allows; heat recovery for building heating; renewable energy sourcing; carbon footprint reporting
  - PUE = Total Facility Power / IT Equipment Power
  - Typical PUE: 1.5–2.0 (older), 1.2–1.5 (modern), 1.05–1.2 (best-in-class)
  - Optimization: airflow management (blank panels, containment), variable speed fans, warm water cooling
  - ROI: energy savings typically pay back in 1–3 years

- **Legacy equipment in data center** (old servers/storage still in use): Legacy equipment may use more power, generate more heat, lack modern monitoring; plan phased replacement; identify critical vs. non-critical legacy systems; document dependencies; allocate replacement budget annually; consider virtualization to consolidate legacy workloads
  - Power impact: older servers may use 3–5× more power for equivalent compute
  - Space impact: legacy 19-inch servers take more U-space than modern dense servers
  - Risk: end-of-support equipment; no security patches; parts unavailable
  - Strategy: virtualize where possible; replace on 3–5 year cycle; track end-of-life dates
