---
name: log-management-analysis
description: Centralize, collect, store, analyze, and derive operational intelligence from logs across the entire IT infrastructure. Use when implementing log management strategy, setting up centralized logging, investigating incidents through log analysis, creating log-based dashboards, configuring log retention and compliance, optimizing log storage costs, or building log-based alerting. Triggers on phrases like "log management", "centralized logging", "log analysis", "log aggregation", "log retention", "SIEM logs", "log search", "ELK stack", "log monitoring", "audit logs".
---

# Log Management & Analysis

Centralize and analyze logs across all systems for operations, security, and compliance.

## Workflow

1. Inventory all log sources across infrastructure (servers, applications, databases, network devices, cloud services, security tools).
2. Design log collection architecture: agents, protocols, formats, and pipeline reliability.
3. Deploy collection agents (Filebeat, Fluentd, Logstash, CloudWatch Agent) on all systems.
4. Configure log parsing, enrichment, and normalization into structured format (JSON).
5. Ingest logs into centralized platform (ELK, Splunk, Datadog, cloud-native).
6. Implement retention policies: hot (30 days), warm (90 days), cold (1 year), archive (7 years).
7. Build dashboards, alerts, and saved searches for operations and security teams.
8. Establish log-based SLAs: ingestion latency < 10 seconds, query response < 5 seconds.
9. Conduct regular log audit: coverage verification, parser accuracy, cost optimization.
10. Optimize storage: log sampling, compression, tiered storage, field pruning.

## Log Sources and Collection

### Infrastructure Log Sources

```
LOG SOURCE INVENTORY
=====================

Operating System Logs:

  Linux (RHEL/CentOS/Ubuntu):
    /var/log/syslog or /var/log/messages    — General system messages
    /var/log/auth.log or /var/log/secure    — Authentication events (sshd, sudo)
    /var/log/kern.log                        — Kernel messages
    /var/log/dmesg                           — Boot-time kernel messages
    /var/log/cron                            — Scheduled task execution
    /var/log/daemon.log                      — Background service logs
    /var/log/boot.log                        — Boot process log
    /var/log/faillog                         — Failed login attempts
    /var/log/wtmp and /var/log/btmp          — Login/logout records

  Windows:
    Application Event Log                    — Application-generated events
    Security Event Log                       — Login attempts, access control (SIEM critical)
    System Event Log                         — OS-level events, driver issues
    Setup Event Log                          — Installation/uninstallation events
    Forwarded Events                         — Centralized from other Windows servers
    PowerShell transcript logs               — Script execution auditing
    IIS logs                                — Web server access/error logs (W3C format)

  Log volume estimate:
    - Typical Linux server: 50MB–500MB/day
    - Typical Windows server: 100MB–1GB/day
    - With verbose debugging: 5GB–20GB/day (avoid in production)
```

### Application Log Sources

```
APPLICATION LOG SOURCES
========================

Web Applications:
  Access logs (Nginx/Apache/IIS):
    Format: Combined Log Format or CLF
    Fields: IP, timestamp, method, URL, status code, response size, referer, user-agent
    Volume: 10,000–1,000,000+ entries per hour depending on traffic
    Use cases: Traffic analysis, error rate tracking, geographic distribution

  Application framework logs:
    Java (Log4j/Logback): DEBUG, INFO, WARN, ERROR levels; structured JSON output
    .NET (NLog/Serilog): Structured logging with correlation IDs
    Python (logging, structlog): JSON-formatted with context fields
    Node.js (Winston, Pino): High-performance structured logging
    Volume: 100MB–10GB/day depending on log level and traffic

API Gateways:
  AWS API Gateway: CloudWatch Logs (request/response metrics + full logs)
  Kong/Envoy: Access logs with latency, status, upstream service
  Azure API Management: Diagnostics logs
  Volume: 50MB–5GB/day per gateway

Database Logs:
  PostgreSQL: Slow query log, checkpoint, connection, replication
  MySQL: General query log, slow query log, error log, binary log
  SQL Server: Error log, default trace, extended events, agent job history
  MongoDB: Diagnostic log, slow query profiler, replication log
  Volume: 100MB–20GB/day (slow query logs especially valuable)

Microservices/Container Logs:
  stdout/stderr from containers (collected by Docker/containerd logging driver)
  Sidecar containers (Fluentd, Fluent Bit) in Kubernetes
  EFK stack (Elasticsearch, Fluentd, Kibana) or ELK (Logstash) pattern
  Volume: 10MB–500MB/day per service instance
```

### Security and Compliance Log Sources

```
SECURITY LOG SOURCES
=====================

Firewall Logs:
  Cisco ASA/Firepower: Connection logs, threat prevention, malware detection
  Palo Alto Networks: Traffic, threat, URL filtering, user identification logs
  Fortinet: Traffic, event, security logs
  Volume: 100MB–10GB/day; critical for SOC operations

IDS/IPS:
  Snort/Suricata: Alert logs with signature matches
  Zeek: Connection, DNS, HTTP, SSL, SSL certificate metadata
  Volume: 500MB–20GB/day on perimeter

Endpoint Security:
  CrowdStrike Falcon: Process execution, network connections, file modifications
  SentinelOne: EDR events, threat detection, behavioral analysis
  Defender for Endpoint: Attack surface reduction, device events
  Volume: 10MB–500MB/day per endpoint

Identity/Authentication:
  Active Directory: Logon/logoff, group membership changes, password resets
  Okta/Azure AD: Sign-in logs, token issuance, conditional access evaluations
  Volume: 1–50MB/day per authentication event source

Cloud Audit Logs:
  AWS CloudTrail: API calls, management events, data events (S3 access)
  Azure Activity Log: Resource management operations
  GCP Audit Logs: Admin, data, system, policy activity
  Volume: 50MB–5GB/day (S3 data events can be massive — use selective logging)
```

## Log Collection Architecture

```
LOG COLLECTION ARCHITECTURE
============================

Collection Pattern 1: Agent-Based (Recommended for servers)

  Server → Log Agent (Filebeat/Fluent Bit) → Log Shipper (Logstash/Fluentd) → Broker (Kafka/Redis) → Index (Elasticsearch/S3)

  Pros: Reliable, handles network issues, buffer locally, can enrich at source
  Cons: Agent maintenance on each server, version management
  Best for: On-prem servers, EC2 instances, VMs, hybrid environments

Collection Pattern 2: Cloud-Native (Recommended for cloud environments)

  Cloud Service → Cloud Watch Logs / Azure Monitor / Cloud Logging → Kinesis / Event Hub → Index

  Pros: No agents needed, integrated billing, automatic scaling
  Cons: Vendor lock-in, limited parsing capabilities, expensive at scale
  Best for: Fully cloud-native workloads

Collection Pattern 3: Sidecar Pattern (Recommended for containers/K8s)

  Pod [App + Log Sidecar] → DaemonSet Collector → Message Queue → Index

  Pros: Container-aware, survives pod restarts, no app changes needed
  Cons: Additional resource usage per pod, complexity
  Best for: Kubernetes environments

Collection Pattern 4: Syslog Forwarding (For network devices)

  Router/Switch/Firewall → Syslog Server (rsyslog) → Log Processor → Index

  Pros: Standard protocol, no agent installation needed
  Cons: Unencrypted (use syslog+TLS), less reliable, no buffering
  Best for: Network devices, legacy systems

Log pipeline reliability requirements:
  - Zero data loss: persistent queue between collector and shipper
  - Retry logic: exponential backoff for temporary failures
  - Dead letter queue: capture unparseable logs for manual review
  - Pipeline monitoring: track ingestion rate, lag, error rate
  - Alerting: pipeline down, lag > 5 minutes, error rate > 1%

Log format standardization:

  Recommended fields for all logs (normalized):
    timestamp (ISO 8601)
    level (DEBUG, INFO, WARN, ERROR, FATAL)
    message (free-text description)
    source (hostname, service name, IP)
    service (application/service identifier)
    request_id / trace_id (correlation across services)
    user_id / username (if applicable)
    environment (prod, staging, dev)
    version (application version)
    metadata (key-value pairs for context)
```

## Log Storage and Retention

```
LOG STORAGE AND RETENTION POLICY
==================================

Tiered storage architecture:

  Tier 1 — Hot Storage (immediate access, indexed):
    Retention: 30 days
    Technology: Elasticsearch cluster, Splunk Hot DB, Datadog Logs
    Performance: < 1 second query response
    Cost: $3–$8 per GB/month
    Content: Current production logs, active investigation data

  Tier 2 — Warm Storage (near-line, compressed):
    Retention: 30–90 days
    Technology: Elasticsearch cold nodes, Splunk Warm/Cold DB, S3 + Athena
    Performance: 5–30 second query response
    Cost: $0.50–$2 per GB/month
    Content: Recent historical data, trend analysis

  Tier 3 — Cold Storage (long-term, archived):
    Retention: 1–2 years
    Technology: S3 Glacier, Azure Archive, GCP Nearline, Azure Cold Blob
    Performance: 3–12 hour retrieval time
    Cost: $0.01–$0.05 per GB/month
    Content: Compliance archives, long-term trend data

  Tier 4 — Legal Hold (indefinite):
    Retention: 7+ years or per regulation
    Technology: Immutable storage (S3 Object Lock, WORM compliance)
    Cost: $0.01–$0.03 per GB/month
    Content: Financial audit logs, security incident evidence, regulatory data

Retention by log type (minimum requirements):

  Security/audit logs:         1 year minimum (hot + warm), 7 years archive (compliance)
  Application error logs:      90 days (covers release cycles + support period)
  Application access logs:     30 days hot, 1 year archive
  System/OS logs:              30 days (unless needed for specific compliance)
  Network logs:                90 days (for security investigation window)
  Database audit logs:         1 year (PCI-DSS requires 1 year minimum)
  Cloud audit logs:            90 days minimum (CloudTrail/Activity Log)
  Debug/trace logs:            7 days (high volume, short-term troubleshooting only)

Storage cost estimation:

  Medium company (500 servers, 50 services):
    Daily log volume: 500GB–2TB/day
    Monthly volume: 15TB–60TB/month
    Annual volume: 180TB–720TB/year

  Hot storage (30 days, 15TB): $45–$120K/month → $540K–$1.44M/year
  Warm storage (60 days, 30TB): $15–$60K/month → $180K–$720K/year
  Cold storage (365 days): $1.5K–$6K/month → $18K–$72K/year

  Total annual log storage: $738K–$2.23M (varies significantly by platform)

  Cost optimization:
    - Log sampling for DEBUG/INFO levels (1 in 100) → 99% volume reduction
    - Field pruning: extract and index only needed fields
    - Compression: Snappy, Zstd (5:1–10:1 ratio on text logs)
    - Deduplication: collapse repeated log messages
    - Auto-delete non-production logs after 7 days
```

## Log Analysis and Investigation

```
LOG-BASED ANALYSIS FRAMEWORK
==============================

Operational analysis (daily monitoring):

  Error rate dashboard:
    - Application errors by service (group by service, level)
    - Error rate trend (hourly, daily, weekly)
    - Error correlation with deployments (overlay deployment timeline)
    - Top error messages (frequency analysis)
    - Alert: error rate > 1% of total requests, or > 10 errors/minute

  Performance analysis:
    - Response time percentiles (p50, p95, p99) from access logs
    - Slow query analysis from database logs (queries > 1 second)
    - Connection pool exhaustion events
    - Memory usage correlation with error patterns
    - Alert: p99 response time > 2 seconds, or 50% increase vs baseline

  Availability monitoring:
    - Service health from application heartbeat logs
    - Upstream dependency failures from error logs
    - Database connection failures
    - DNS resolution errors
    - Alert: service health check failures > 3 consecutive

Security analysis (continuous monitoring):

  Authentication anomalies:
    - Failed login attempts > 5 per minute per IP (brute force detection)
    - Login from unusual geographic location
    - Login at unusual time (outside business hours)
    - Privilege escalation attempts (sudo failures, admin access from non-admin)
    - Alert: any of above patterns

  Data access patterns:
    - Unusual data download volumes (potential exfiltration)
    - Access to sensitive endpoints from unauthorized IPs
    - Database queries accessing sensitive tables (PII, financial data)
    - API key usage anomalies (new IP, volume spike)
    - Alert: data access volume > 3 standard deviations from baseline

  Network security:
    - Port scanning patterns (multiple ports from single IP in short time)
    - DNS tunneling detection (excessive DNS queries with encoded data)
    - Unencrypted protocol usage (HTTP, FTP, Telnet)
    - Firewall block rate increase
    - Alert: any port scan pattern, DNS tunneling signature

Incident investigation workflow:

  Step 1: Identify the scope
    - What time range? (when was issue first reported)
    - Which systems affected? (affected services, servers, regions)
    - What is the impact? (users affected, revenue impact)

  Step 2: Gather evidence
    - Search logs from affected systems for error patterns
    - Follow correlation IDs/trace IDs across services
    - Check deployment history (was there a recent change?)
    - Review infrastructure changes (scaling, configuration)

  Step 3: Root cause analysis
    - Identify the first error in the chain
    - Trace dependency failures
    - Correlate with external events (DDoS, cloud outage, certificate expiry)
    - Review code changes (git log, PR history)

  Step 4: Documentation
    - Document timeline with timestamps
    - Capture key log snippets as evidence
    - Create incident report with RCA
    - Update runbooks and detection rules

Common log queries (saved searches):

  # Find all errors for a service in last hour
  service:payment-service level:error | stats count by message | sort -count

  # Find failed logins in last 15 minutes
  source:auth level:error "failed login" | stats count by ip, username | sort -count

  # Track deployment correlation with errors
  (deployment OR error) service:api | timespan 5m | stats count by type

  # Identify slow queries
  source:database query_time:>1000 | stats avg(query_time), count by query_hash | sort -avg(query_time)

  # Detect data exfiltration pattern
  source:firewall action:allow bytes_out:>100000000 | stats sum(bytes_out) by dst_ip | sort -sum(bytes_out)
```

## Log-Based Alerting

```
LOG-BASED ALERTING CONFIGURATION
===================================

Alert rules (examples by priority):

  P1 — Critical (page on-call immediately):
    - Production service error rate > 5% for 2 minutes
    - Authentication breach detected (successful login after 50+ failures)
    - Data loss event detected (database corruption, RDS failover)
    - Ransomware indicators (mass file encryption in logs)
    - Channel: PagerDuty/Opsgenie → phone call + SMS + email
    - Response SLA: 5 minutes

  P2 — High (notify team within 15 minutes):
    - Production service error rate > 1% for 5 minutes
    - Database connection pool at 90% capacity
    - Disk usage > 85% on production servers
    - SSL certificate expiring within 7 days
    - Cloud cost anomaly (> 20% above baseline)
    - Channel: Slack/Teams #incidents channel + email
    - Response SLA: 15 minutes

  P3 — Medium (notify team within 1 hour):
    - Staging environment errors
    - Non-critical service degraded performance
    - Unusual but not alarming security events
    - Log pipeline lag > 5 minutes
    - Channel: Slack/Teams team channel
    - Response SLA: 1 hour

  P4 — Low (daily summary):
    - Non-production warning logs
    - Informational compliance events
    - Capacity trends approaching thresholds
    - Channel: Daily digest email
    - Response SLA: Next business day

Alert tuning best practices:
  - Require minimum duration before alerting (avoid flapping)
  - Implement alert deduplication (same root cause = single alert)
  - Use runbook links in every alert
  - Track alert-to-action ratio (goal: > 80% true positive)
  - Review and tune alerts monthly (disable stale alerts)
  - Never alert on metrics that cannot be acted upon
  - Implement alert escalation (P1: 5min → 15min → 30min escalation)
```

## Compliance and Audit

```
LOG COMPLIANCE REQUIREMENTS
=============================

PCI-DSS (Payment Card Industry Data Security Standard):

  Requirement 10: Track and monitor all access
  - Maintain audit trail for all system components
  - Log all access to system components and cardholder data
  - Log all actions taken by any individual with SUID/privileged access
  - Log all actions to audit trails themselves (prevent log tampering)
  - Review logs for anomalies daily
  - Retain audit trail history for at least 1 year (3 months immediately available)
  - Implement automated log review tools
  - Implement time-synchronization (NTP) with 100ms accuracy
  - Protect audit trails from modification

  Estimated log volume: 500MB–5GB/day for PCI scope systems

HIPAA (Health Insurance Portability and Accountability Act):

  Audit Controls (45 CFR 164.312(b)):
  - Implement hardware, software, and/or procedural mechanisms to record and examine activity in information systems
  - Log: access, creation, modification, deletion, and transmission of ePHI
  - Maintain logs for 6 years minimum
  - Ensure logs are tamper-evident or tamper-proof
  - Monitor and review logs regularly

  Estimated log volume: 1GB–10GB/day for healthcare systems

SOX (Sarbanes-Oxley Act):

  Section 404: Internal controls over financial reporting
  - Log all access to financial systems
  - Track data changes in financial databases
  - Maintain immutable audit trails
  - Review access permissions quarterly
  - Retain logs for 7 years minimum

GDPR (General Data Protection Regulation):

  Article 30: Records of processing activities
  - Log data processing activities
  - Track data subject access requests
  - Log data breaches (document within 72 hours of awareness)
  - Retain processing records as required by member state law

SOC 2 Type II:

  - Common Criteria: Security, Availability, Processing Integrity, Confidentiality, Privacy
  - Log all access to systems (successful and unsuccessful)
  - Retain logs for minimum 1 year
  - Demonstrate log review procedures to auditors
  - Show evidence of anomaly detection and response
```

## Integration Points

- **ELK Stack** (Elasticsearch, Logstash, Kibana): Open-source log management; Elasticsearch for indexing/search, Logstash for parsing/transforming, Kibana for visualization
- **Splunk**: Enterprise log management; powerful search language (SPL), machine learning toolkit, security apps (Splunk ES)
- **Datadog Logs**: Cloud-native log management; integrated with APM, infrastructure monitoring, synthetic monitoring
- **Sumo Logic**: Cloud-native log analytics; pre-built parsers, ML-powered anomaly detection, security analytics
- **Grafana Loki**: Lightweight log aggregation; designed to work with Prometheus metrics; low storage cost
- **Cloud-native** (CloudWatch Logs, Azure Monitor Logs, GCP Logging): Platform-specific log management; no agent needed for managed services
- **Fluentd/Fluent Bit**: Universal log collectors; 1000+ plugins; widely used in Kubernetes
- **Kafka**: Message queue for log pipeline; handles high throughput; enables multiple consumers
- **Graylog**: Open-source log management; built on Elasticsearch; enterprise features available
- **PagerDuty/Opsgenie**: Alerting and incident management; integrate with log platforms for automated alerting

## Edge Cases

- **High-volume log sources** (load balancers, CDN, API gateways generating 100GB+/day): Use log sampling (e.g., 1 in 100 for access logs, 100% for errors only); implement pre-aggregation at source; use streaming analytics for real-time metrics; archive raw logs to S3 Glacier for forensic access if needed; budget: $500–$5,000/month for high-volume log storage
  - Nginx access logs: sample at 1% for trend analysis, 100% for errors
  - API gateway logs: sample at 10% for normal traffic, 100% for 4xx/5xx responses
  - CloudTrail: use Event Selectors to log only specific APIs (reduces 80%+ of management events)

- **Multi-region/global log aggregation**: Deploy regional log collectors to avoid cross-region data transfer costs; aggregate to central index for unified search; consider data residency requirements (GDPR: EU logs must stay in EU); use CDN for log shipping to reduce latency
  - Regional collectors: 3–5 regions, each with Filebeat/Fluent Bit → regional Logstash
  - Central index: Elasticsearch cluster in primary region with cross-cluster search
  - Cross-region transfer cost: AWS charges $0.02–$0.06/GB between regions

- **Log tampering detection**: Implement WORM (write-once-read-many) storage for audit logs; use blockchain or Merkle tree for tamper-proof verification; alert on any log gap, sequence number skip, or backward timestamp; separate log management system from production systems (air-gapped log server for critical compliance)
  - Sysmon with log forwarding to isolated SIEM
  - S3 Object Lock for cloud-based immutable log storage
  - Regular integrity checks: hash verification of log files

- **Log schema evolution**: As applications change, log formats change — break parsers; implement schema registry for log events; use flexible parsing (regex with fallback); version log schemas; alert on parse failure rate increase (> 5% unparseable = investigation needed)
  - Use OpenTelemetry log schema for standardization
  - Maintain backward-compatible log formats
  - Test parser changes in staging before production deployment

- **Cost runaway prevention**: Cloud log ingestion can spike unexpectedly (debug logging enabled in production, log loop in application); implement ingestion budgets with hard limits; alert on daily spend exceeding 110% of forecast; use auto-sampling to reduce volume automatically when cost thresholds are hit
  - Set Datadog/Splunk ingestion cap at 120% of monthly budget
  - Implement log level governance: ERROR and WARN always, INFO sampled, DEBUG only in dev/staging
  - Monthly cost review: top 10 log sources by volume and cost

- **Real-time vs. batch log processing**: Security incidents need real-time log analysis (< 10 seconds); compliance reporting can use batch processing (daily); implement dual pipeline: hot path for real-time (Kafka → Elasticsearch) and cold path for archival (Kafka → S3); allocate resources accordingly
  - Real-time pipeline: process security, error, and P1 alert logs
  - Batch pipeline: aggregate metrics, generate daily reports, compliance evidence collection
