---
name: dns-management-monitoring
description: Manage DNS infrastructure including zone management, health monitoring, geographic routing, failover, DNSSEC security, and DNS analytics. Use when configuring DNS zones, monitoring DNS health, implementing geo-routing, setting up DNS failover, securing DNS with DNSSEC, analyzing DNS query patterns, optimizing TTL settings, or troubleshooting DNS resolution issues. Triggers on phrases like "DNS management", "DNS monitoring", "DNS failover", "DNSSEC", "geo-routing", "DNS health", "DNS zone", "DNS analytics", "name resolution", "TTL optimization".
---

# DNS Management & Monitoring

Manage DNS infrastructure for reliable name resolution, intelligent routing, and security.

## Workflow

1. Audit DNS infrastructure: zone files, record types, nameservers, DNSSEC status, TTL values.
2. Implement DNS health monitoring: query response time, resolution failures, propagation delays.
3. Configure geographic routing: route users to nearest data center or cloud region for lowest latency.
4. Set up DNS failover: health checks, automatic failover, DNS-based load balancing.
5. Deploy DNSSEC: generate keys, sign zones, configure DS records at parent zone, validate signatures.
6. Optimize DNS performance: TTL tuning, Anycast DNS, CDN integration, DNS-over-HTTPS/TLS.
7. Monitor DNS security: detect DNS tunneling, cache poisoning attempts, DDoS against DNS.
8. Conduct DNS analytics: query volume trends, top queried domains, geographic distribution.
9. Maintain DNS documentation: zone inventory, delegation records, contact information (RFC 3912).
10. Test DNS resilience quarterly: simulate nameserver failure, verify failover, measure RTO.

## DNS Zone Management

```
DNS ZONE INVENTORY AND MANAGEMENT
====================================

Zone file structure:

  $TTL 300                    ; Default TTL (seconds)
  $ORIGIN example.com.        ; Default domain
  @       IN      SOA       ns1.example.com. admin.example.com. (
                                2024011501  ; Serial (YYYYMMDDNN format)
                                7200        ; Refresh (seconds — secondary checks for updates)
                                3600        ; Retry (seconds — how often to retry on failure)
                                1209600     ; Expire (seconds — when secondary gives up)
                                86400       ; Minimum TTL (seconds — negative response caching)
                            )

  ; Nameservers
  @       IN      NS        ns1.example.com.
  @       IN      NS        ns2.example.com.
  @       IN      NS        ns3.example.com.

  ; A Records (IPv4)
  @       IN      A         203.0.113.10    ; apex domain
  www     IN      A         203.0.113.10
  api     IN      A         203.0.113.20
  mail    IN      A         203.0.113.30

  ; AAAA Records (IPv6)
  @       IN      AAAA      2001:db8::10
  www     IN      AAAA      2001:db8::10

  ; CNAME Records (aliases)
  blog    IN      CNAME     www.example.com.
  shop    IN      CNAME     shop.platform.com.

  ; MX Records (mail exchange — priority order)
  @       IN      MX        10 mail.example.com.
  @       IN      MX        20 mail2.example.com.

  ; TXT Records (SPF, DKIM, DMARC, verification)
  @       IN      TXT       "v=spf1 include:_spf.google.com ~all"
  _dmarc  IN      TXT       "v=DMARC1; p=reject; rua=mailto:dmarc@example.com"
  default-domain-key._domainkey IN TXT "v=DKIM1; k=rsa; p=MIGfMA0GCS..."

  ; SRV Records (service location)
  _sip._tcp IN SRV 10 60 5060 sip.example.com.

DNS record best practices:

  NS records:
    - Minimum 2 nameservers (different providers for redundancy)
    - Recommended: 3-4 nameservers across different networks
    - Use Anycast DNS providers (Cloudflare, AWS Route 53, Google Cloud DNS)
    - NS record TTL: typically 86400 (1 day) — changed by registry

  A/AAAA records:
    - Always include both A and AAAA for IPv6 support
    - Use weighted records for load balancing
    - TTL: 300 (5 min) for frequently changing; 3600 (1 hr) for stable

  CNAME records:
    - Never use CNAME at apex (use ALIAS or ANAME instead)
    - Avoid CNAME chaining (> 3 hops — adds query latency)
    - CNAME cannot coexist with other record types at same name

  MX records:
    - Always have backup MX with higher priority number
    - MX target must have A record (not CNAME)
    - Priority values: 10 (primary), 20 (backup), 30 (overflow)

  TXT records:
    - SPF: use -all (hard fail) after testing with ~all (soft fail)
    - DKIM: rotate keys annually (maintain 2 selectors for seamless rotation)
    - DMARC: start with p=none, move to quarantine, then reject
    - Limit SPF lookups to 10 (use include: sparingly)

TTL optimization:

  Record Type          Recommended TTL    Rationale
  ──────────────────   ───────────────    ───────────────────────────────────
  Apex A/AAAA          300 (5 min)        Allow quick failover; use ALIAS with provider
  www A/AAAA           60 (1 min)         Fast failover for customer-facing endpoints
  api A                30 (30 sec)        Rapid traffic shift during incidents
  MX                   3600 (1 hr)        Mail routing changes infrequently
  NS                   86400 (1 day)      Changed only for nameserver migration
  TXT (SPF/DKIM)       3600 (1 hr)        Changes rare; cache for performance
  CNAME (CDN)          300 (5 min)        CDN handles caching independently
  Internal records     60 (1 min)         Frequent changes in internal DNS

  TTL tradeoff: lower TTL = faster failover but more DNS query volume
  Rule of thumb: set TTL to maximum acceptable failover time + 1 minute
```

## DNS Health Monitoring

```
DNS HEALTH MONITORING FRAMEWORK
=================================

Monitoring metrics:

  Query response time:
    - Internal resolver to root TLD: < 100ms
    - Authoritative response time: < 50ms
    - CDN-integrated DNS (Cloudflare): < 10ms
    - Alert: response time > 200ms for 5 consecutive checks
    - Check frequency: every 30 seconds

  Resolution success rate:
    - Target: > 99.9% successful resolutions
    - Monitor: NXDOMAIN, SERVFAIL, REFUSED responses
    - Alert: SERVFAIL rate > 1% (indicates nameserver issues)
    - Track by: record type, domain, geographic location

  DNS propagation:
    - After zone change: monitor propagation to global resolvers
    - Use external DNS check tools (dnscheck.org, dig from multiple locations)
    - TTL-dependent: full propagation takes up to current TTL value
    - Alert: propagation incomplete after TTL + 10 minutes

  Nameserver health:
    - Check each nameserver independently (every 60 seconds)
    - Verify all NS records respond consistently
    - Alert: any nameserver not responding for > 60 seconds
    - Alert: inconsistent responses between nameservers

  DNSSEC validation:
    - Monitor DNSSEC signing status (signed/unsigned zones)
    - Check key expiration (KSK, ZSK)
    - Alert: DNSSEC validation failures increasing
    - Alert: key expiring within 30 days

Monitoring tools and methods:

  External monitoring (customer perspective):
    - dig/nslookup from multiple geographic locations
    - Synthetic DNS queries every 30-60 seconds
    - Tools: DNSChecker, UptimeRobot DNS check, Datadog synthetic monitors
    - Monitor: resolution correctness, response time, TTL returned

  Internal monitoring (operator perspective):
    - Bind named statistics (if using BIND)
    - Cloud provider DNS metrics (Route 53 Query Logs, Cloudflare Analytics)
    - Prometheous DNS exporter for open-source resolvers
    - Track: query rate, cache hit ratio, upstream failures

  DNS query logging (for forensics):
    - Enable query logging on authoritative and recursive resolvers
    - Log: client IP, query name, query type, response code, response time
    - Volume: 10,000-1,000,000+ queries per hour per resolver
    - Retention: 30 days hot, 1 year archive
    - Use: troubleshooting, security investigation, capacity planning

  Alert configuration:

    Critical (page immediately):
      - All nameservers unresponsive for > 60 seconds
      - SERVFAIL rate > 5%
      - DNSSEC break (zone becomes invalid)
      - Domain expiration approaching (< 30 days)

    Warning (notify within 15 minutes):
      - Single nameserver unresponsive
      - Response time > 500ms
      - NXDOMAIN rate spike > 2x baseline
      - Zone transfer failure to secondary

    Info (daily summary):
      - Query volume trends
      - Top queried domains
      - TTL distribution analysis
      - New record change summary
```

## Geographic Routing and Failover

```
GEOGRAPHIC ROUTING AND FAILOVER
=================================

Geographic routing strategies:

  Latency-based routing:
    - Route user to endpoint with lowest latency
    - DNS provider measures latency from resolver to endpoints
    - Automatic selection of fastest endpoint
    - Example: AWS Route 53 latency-based routing
    - Benefit: best user experience without manual geo-mapping
    - Limitation: resolver location may not match user location

  Geolocation-based routing:
    - Route based on DNS resolver's geographic location
    - Define regions: North America, Europe, Asia Pacific, etc.
    - Map regions to specific endpoints
    - Example: Cloudflare DNS geolocation rules
    - Benefit: predictable routing; compliance with data residency
    - Limitation: resolver ≠ user location (VPN, proxy)

  Weighted routing:
    - Distribute traffic by percentage weights
    - Example: 80% primary data center, 20% secondary
    - Use for: gradual migration, A/B testing, load distribution
    - Can combine with geographic routing (weighted within region)

  Multi-value routing:
    - Return multiple A records in response
    - Client tries records in order (built-in failover)
    - Example: return 3 API server IPs per query
    - Health checks remove unhealthy endpoints
    - Simpler than failover routing; no single point of failure

DNS-based failover:

  Active-passive failover:
    - Primary endpoint serves all traffic
    - Health checks monitor primary (HTTP, TCP, DNS checks)
    - On failure: DNS switches to secondary endpoint
    - Failover time: health check interval + TTL (typically 30-120 seconds)
    - Example: AWS Route 53 failover routing policy

  Active-active failover:
    - Multiple endpoints serve traffic simultaneously
    - Health checks on all endpoints
    - Unhealthy endpoints automatically removed from rotation
    - No single failover event; continuous health evaluation
    - Example: multi-region load balancing with health checks

  Health check configuration:

    Check type:            Port    Interval    Failure Threshold   Recovery Threshold
    ────────────────────  ──────  ──────────  ──────────────────  ────────────────────
    HTTP/HTTPS             443     10 sec      3 consecutive       3 consecutive
    TCP                    443     10 sec      3 consecutive       2 consecutive
    ICMP (ping)            —       30 sec      3 consecutive       2 consecutive
    DNS                    53      30 sec      3 consecutive       2 consecutive
    String match (HTTP)    443     10 sec      3 consecutive       3 consecutive

    HTTP health check details:
      - Method: GET or HEAD
      - Path: /health or /status (lightweight endpoint)
      - Expected status: 200 OK
      - Expected body string: "OK" (optional, more specific)
      - Timeout: 5 seconds
      - Protocol: HTTPS with SNI

  Failover testing (quarterly):
    - Simulate primary failure (stop service or block health check)
    - Measure failover time (health check detection + DNS propagation)
    - Verify traffic routing to secondary
    - Test failback (restore primary, verify traffic returns)
    - Document results; update runbook if needed

  Geographic failover example:

    Regions and endpoints:
      US East:    api-use1.example.com → 203.0.113.10 (primary)
      US West:    api-usw1.example.com → 203.0.113.20 (failover)
      EU West:    api-euw1.example.com → 198.51.100.10 (primary)
      EU Central: api-euc1.example.com → 198.51.100.20 (failover)
      APAC:       api-ap1.example.com  → 198.51.100.30 (primary)

    Routing rules:
      North America resolver → US East (primary), US West (failover)
      Europe resolver       → EU West (primary), EU Central (failover)
      Asia resolver         → APAC (primary), US West (failover)
      All others            → US East (primary), EU West (failover)

    Failover scenario: US East goes down
      - Health check detects failure in 30 seconds
      - DNS switches North America traffic to US West
      - Existing connections: may fail; clients retry with fresh DNS
      - New connections: routed to US West automatically
      - Total customer-visible impact: 30-120 seconds
```

## DNSSEC Implementation

```
DNSSEC IMPLEMENTATION GUIDE
==============================

DNSSEC overview:
  - Adds cryptographic signatures to DNS records
  - Prevents DNS spoofing, cache poisoning, man-in-the-middle attacks
  - Does NOT encrypt DNS (use DNS-over-HTTPS/TLS for that)
  - Adds overhead: larger responses (~50% increase), signing operations

  Chain of trust:
    Root Zone → TLD (.com, .org) → Your Domain → Subdomains
    Each level signs the level below it

Implementation steps:

  Step 1: Generate DNSSEC keys (using DNSCrypt, OpenDNSSEC, or provider tools)

    # Generate Key Signing Key (KSK) — long-lived, signs DNSKEY record
    dnssec-keygen -a RSASHA256 -b 2048 -n ZONE example.com

    # Generate Zone Signing Key (ZSK) — rotated more frequently, signs all records
    dnssec-keygen -a RSASHA256 -b 1024 -n ZONE example.com

    Key types:
      KSK (Key Signing Key):
        - Size: 2048 bits (RSA) or Ed25519
        - Lifetime: 1-2 years
        - Signs: DNSKEY record only
        - Used by: parent zone for DS record

      ZSK (Zone Signing Key):
        - Size: 1024-2048 bits (RSA) or Ed25519
        - Lifetime: 30-90 days
        - Signs: all zone records
        - Rotated: automatically by DNSSEC software

  Step 2: Sign the zone

    # Sign zone file with generated keys
    dnssec-signzone -o example.com -N INCREMENT -k example.com zone/file.db

    Output: zone file with RRSIG, DNSKEY, and RRSIG records added
    Verification: check that all record sets have RRSIG records

  Step 3: Publish DNSKEY records

    - DNSKEY records published in your zone
    - Contains public KSK and ZSK
    - Flag: 256 = ZSK, 257 = KSK

  Step 4: Create DS record at parent zone

    - Extract KSK fingerprint
    dnssec-dsfromkey -1 example.com.ksk

    - Submit DS record to domain registrar:
      Key Tag: [number]
      Algorithm: 8 (RSASHA256) or 15 (Ed25519)
      Digest Type: 2 (SHA-256)
      Digest: [hash value]

    - DS record bridges trust between your zone and TLD
    - This step makes DNSSEC "glue" active

  Step 5: Verify DNSSEC

    # Check DNSSEC chain
    dig +dnssec +trace example.com

    # Verify zone signing
    dig DNSKEY example.com
    dig RRSIG SOA example.com

    # Online validators
    - verifier.verisign.com
    - dnssec-debugger.verisign.com
    - zonechecker.verisign.com

Key rotation schedule:

  ZSK rotation: every 30-90 days (automated)
    1. Generate new ZSK
    2. Sign zone with both old and new ZSK (dual-signing period: 7 days)
    3. Remove old ZSK after dual-signing period
    4. Resign zone with new ZSK only
    5. Publish updated zone

  KSK rotation: every 1-2 years (manual or automated)
    1. Generate new KSK
    2. Publish new DNSKEY (both old and new KSK)
    3. Submit new DS record to registrar (keep old DS active)
    4. Dual-signing period: 30 days minimum
    5. Remove old KSK and old DS record
    6. Resign zone with new KSK only

  Emergency key revocation (DNSSEC denial-of-service detected):
    1. Immediately generate new KSK and ZSK
    2. Resign zone with new keys
    3. Update DS record at registrar
    4. Publish new zone
    5. Investigate and mitigate attack
```

## DNS Security

```
DNS SECURITY FRAMEWORK
========================

Threat types and mitigation:

  1. DNS Cache Poisoning:
     - Attack: inject false DNS records into resolver cache
     - Impact: users directed to malicious sites
     - Mitigation: DNSSEC (cryptographic validation); randomize source ports
     - Detection: DNSSEC validation failures; unexpected IP responses

  2. DDoS Against DNS Infrastructure:
     - Attack: flood DNS servers with queries (amplification attacks)
     - Impact: DNS servers overwhelmed; resolution failures
     - Volume: 10-100+ Gbps attack traffic
     - Mitigation: Anycast DNS (absorbs attack across multiple locations);
       rate limiting; upstream DDoS protection (Cloudflare, AWS Shield)
     - Detection: query volume spike > 10x baseline; response time degradation

  3. DNS Tunneling:
     - Attack: encode data in DNS queries/responses (exfiltration, C2)
     - Impact: data exfiltration bypassing firewall; command-and-control
     - Detection: unusually long subdomain labels (> 50 chars); high query volume
       to single domain; encoded data patterns (base64, hex in labels)
     - Mitigation: DNS query inspection; block known tunneling domains;
       monitor query entropy (high entropy = suspicious)
     - Alert: queries with label length > 60 characters; > 100 queries/min to unusual domain

  4. DNS Amplification Attacks:
     - Attack: use open DNS resolvers to amplify traffic to victim
     - Impact: victim overwhelmed; your resolver may be abused
     - Mitigation: disable recursive queries for external clients (authoritative only);
       implement response rate limiting (RRL); block spoofed source IPs
     - Detection: large response-to-query ratio; traffic to/from port 53 spikes

  5. Domain Hijacking:
     - Attack: unauthorized transfer or modification of domain DNS
     - Impact: complete loss of domain control; all services disrupted
     - Mitigation: registrar account lockdown; transfer lock; 2FA on registrar;
       separate email address for registrar account
     - Prevention: domain expiration monitoring (alert 90, 60, 30 days before);
       registrar account security audit annually

  6. DNS Rebinding:
     - Attack: exploit browser same-origin policy via changing DNS responses
     - Impact: access to internal services from browser
     - Mitigation: set very low TTL for internal addresses; use source address
       validation (SAV) on firewalls; block DNS responses with private IPs externally
     - Detection: DNS responses returning private IP ranges (10.x, 192.168.x, 172.16.x)

DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT):

  Purpose: encrypt DNS queries between client and resolver
  Prevents: ISP monitoring, DNS spoofing on unsecured networks, query manipulation

  DNS-over-TLS (DoT):
    - Port: 853 (separate from standard DNS port 53)
    - Protocol: TLS wrapper around DNS
    - Client support: Windows 10 1809+, iOS 14+, Android 9+
    - Resolvers: Cloudflare (1.1.1.1), Google (8.8.8.8), Quad9 (9.9.9.9)

  DNS-over-HTTPS (DoH):
    - Port: 443 (same as HTTPS — harder to block)
    - Protocol: DNS queries encoded in HTTP/HTTPS
    - Client support: Firefox (default), Chrome 83+, Edge 84+
    - Resolvers: same as DoT; any HTTPS endpoint

  DNS-over-HTTPS for corporate:
    - Deploy internal DoH resolver (Unbound, Pi-hole with TLS)
    - Block external DoH to prevent policy bypass
    - Monitor: all DNS queries visible to corporate resolver
    - Configuration: push DoH settings via GPO/MDM to all devices
```

## Integration Points

- **AWS Route 53**: Managed DNS; latency/geolocation/failover routing; health checks; DNSSEC; query logs; integration with AWS services
- **Cloudflare DNS**: Free managed DNS; Anycast global network; automatic DNSSEC; DoH/DoT; DDoS protection; analytics dashboard
- **Google Cloud DNS**: Managed DNS; Anycast; DNSSEC; health checks; logging; integration with GCP load balancers
- **Azure DNS**: Managed DNS; geo-DNS; traffic manager integration; DNSSEC; zone transfers
- **BIND (ISC)**: Open-source DNS server; most widely deployed; supports all DNS features; requires manual management
- **PowerDNS**: Open-source DNS server; API-driven; multiple backends (MySQL, PostgreSQL, LDAP); DNSSEC support
- **NS1 / Dyn / Oracle Cloud DNS**: Enterprise DNS; advanced traffic management; API-driven; global Anycast
- **DNS monitoring** (DNSChecker, UptimeRobot, Datadog): External DNS health checks; propagation monitoring; alerting

## Edge Cases

- **DNS change causing global outage** (misconfiguration, typo in zone file): Implement DNS change management: all zone changes via CI/CD pipeline; pre-change validation (DNSLint, zone file syntax check); staged rollout (update one nameserver, verify, then others); rollback within 60 seconds; maintain previous zone file version
  - Pre-change validation: check for duplicate records, syntax errors, missing SOA, TTL within policy
  - Staged rollout: update secondary NS first; verify health; update primary; monitor for 15 minutes
  - Emergency rollback: scripted one-command rollback to previous zone version
  - Alerting: real-time SERVFAIL monitoring; auto-page if resolution failure rate > 1%

- **Multi-provider DNS redundancy** (avoid single DNS provider dependency): Use DNS providers from at least 2 different companies; split NS records across providers (2 from Provider A, 2 from Provider B); ensure zone files synchronized via AXFR/IXFR or manual deployment; test failover between providers quarterly
  - Example: Cloudflare (ns1-2.cloudflare.com) + AWS Route 53 (ns1-2.awsdns-xx.com)
  - Synchronization: automated job pushes zone changes to both providers
  - Risk: zone divergence if sync fails; monitor with checksum comparison

- **Domain expiration risk** (domain not renewed, services go dark): Critical domains: set auto-renewal; monitor expiration date; alert at 180, 90, 60, 30 days before expiration; keep domain registration funded (12+ months); use domain registration lock; separate registrar contact from standard email
  - Cost of domain loss: hours/days of complete service outage; estimated revenue impact
  - Private registration: WHOIS privacy to prevent domain contact scraping
  - Subdomain strategy: register common typos and variations to prevent phishing

- **Large-scale DNS management** (1000+ zones, enterprise environment): DNS orchestration via API (Terraform, Ansible); zone template management; bulk operations; DNS audit logging; automated compliance checks (TTL policy, required records); team-based access control
  - Terraform for DNS: manage all zones as code; version control; PR review for changes
  - Audit trail: who changed what record, when, from what to what
  - Compliance: automated scan for zones missing DMARC, SPF, or DNSSEC
  - Automation: auto-create DNS records for new services via service mesh integration

- **Internal DNS management** (split-horizon DNS, internal services): Separate internal and external DNS zones; internal resolvers for corporate network; split-horizon: different responses for internal vs external queries; internal DNS records for services not exposed externally (.internal, .corp zones)
  - Split-horizon: www.example.com returns public IP externally, internal load balancer IP internally
  - Tools: Active Directory DNS, BIND with view configuration, Infoblox
  - Security: internal DNS should not leak internal infrastructure details externally
  - Monitoring: separate health checks for internal and external DNS

- **DNS migration** (changing DNS providers): Plan 6-8 weeks ahead; add new provider NS records alongside old (dual-homing); verify zone data matches; switch registrar to point to new NS; monitor both providers for 2 weeks; remove old NS records; retire old provider
  - Week 1-2: set up new provider; copy all zones; verify record accuracy
  - Week 3-4: add new NS records at registrar (alongside existing)
  - Week 5-6: verify both providers respond identically; test failover
  - Week 7: remove old NS records from registrar
  - Week 8: monitor; retire old provider account; document lessons learned
