IT AI Skill
Log Management Analysis
Centralize, collect, store, analyze, and derive operational intelligence from logs across the entire IT infrastructure. Use when implementing log management strategy, setting up centralized logging, investigating incidents through log analysis, creating log...
Log Management & Analysis
Centralize and analyze logs across all systems for operations, security, and compliance.
Workflow
- Inventory all log sources across infrastructure (servers, applications, databases, network devices, cloud services, security tools).
- Design log collection architecture: agents, protocols, formats, and pipeline reliability.
- Deploy collection agents (Filebeat, Fluentd, Logstash, CloudWatch Agent) on all systems.
- Configure log parsing, enrichment, and normalization into structured format (JSON).
- Ingest logs into centralized platform (ELK, Splunk, Datadog, cloud-native).
- Implement retention policies: hot (30 days), warm (90 days), cold (1 year), archive (7 years).
- Build dashboards, alerts, and saved searches for operations and security teams.
- Establish log-based SLAs: ingestion latency < 10 seconds, query response < 5 seconds.
- Conduct regular log audit: coverage verification, parser accuracy, cost optimization.
- Optimize storage: log sampling, compression, tiered storage, field pruning.
Log Sources and Collection
Infrastructure Log Sources
LOG SOURCE INVENTORY
=====================
Operating System Logs:
Linux (RHEL/CentOS/Ubuntu):
/var/log/syslog or /var/log/messages — General system messages
/var/log/auth.log or /var/log/secure — Authentication events (sshd, sudo)
/var/log/kern.log — Kernel messages
/var/log/dmesg — Boot-time kernel messages
/var/log/cron — Scheduled task execution
/var/log/daemon.log — Background service logs
/var/log/boot.log — Boot process log
/var/log/faillog — Failed login attempts
/var/log/wtmp and /var/log/btmp — Login/logout records
Windows:
Application Event Log — Application-generated events
Security Event Log — Login attempts, access control (SIEM critical)
System Event Log — OS-level events, driver issues
Setup Event Log — Installation/uninstallation events
Forwarded Events — Centralized from other Windows servers
PowerShell transcript logs — Script execution auditing
IIS logs — Web server access/error logs (W3C format)
Log volume estimate:
- Typical Linux server: 50MB–500MB/day
- Typical Windows server: 100MB–1GB/day
- With verbose debugging: 5GB–20GB/day (avoid in production)
Application Log Sources
APPLICATION LOG SOURCES
========================
Web Applications:
Access logs (Nginx/Apache/IIS):
Format: Combined Log Format or CLF
Fields: IP, timestamp, method, URL, status code, response size, referer, user-agent
Volume: 10,000–1,000,000+ entries per hour depending on traffic
Use cases: Traffic analysis, error rate tracking, geographic distribution
Application framework logs:
Java (Log4j/Logback): DEBUG, INFO, WARN, ERROR levels; structured JSON output
.NET (NLog/Serilog): Structured logging with correlation IDs
Python (logging, structlog): JSON-formatted with context fields
Node.js (Winston, Pino): High-performance structured logging
Volume: 100MB–10GB/day depending on log level and traffic
API Gateways:
AWS API Gateway: CloudWatch Logs (request/response metrics + full logs)
Kong/Envoy: Access logs with latency, status, upstream service
Azure API Management: Diagnostics logs
Volume: 50MB–5GB/day per gateway
Database Logs:
PostgreSQL: Slow query log, checkpoint, connection, replication
MySQL: General query log, slow query log, error log, binary log
SQL Server: Error log, default trace, extended events, agent job history
MongoDB: Diagnostic log, slow query profiler, replication log
Volume: 100MB–20GB/day (slow query logs especially valuable)
Microservices/Container Logs:
stdout/stderr from containers (collected by Docker/containerd logging driver)
Sidecar containers (Fluentd, Fluent Bit) in Kubernetes
EFK stack (Elasticsearch, Fluentd, Kibana) or ELK (Logstash) pattern
Volume: 10MB–500MB/day per service instance
Security and Compliance Log Sources
SECURITY LOG SOURCES
=====================
Firewall Logs:
Cisco ASA/Firepower: Connection logs, threat prevention, malware detection
Palo Alto Networks: Traffic, threat, URL filtering, user identification logs
Fortinet: Traffic, event, security logs
Volume: 100MB–10GB/day; critical for SOC operations
IDS/IPS:
Snort/Suricata: Alert logs with signature matches
Zeek: Connection, DNS, HTTP, SSL, SSL certificate metadata
Volume: 500MB–20GB/day on perimeter
Endpoint Security:
CrowdStrike Falcon: Process execution, network connections, file modifications
SentinelOne: EDR events, threat detection, behavioral analysis
Defender for Endpoint: Attack surface reduction, device events
Volume: 10MB–500MB/day per endpoint
Identity/Authentication:
Active Directory: Logon/logoff, group membership changes, password resets
Okta/Azure AD: Sign-in logs, token issuance, conditional access evaluations
Volume: 1–50MB/day per authentication event source
Cloud Audit Logs:
AWS CloudTrail: API calls, management events, data events (S3 access)
Azure Activity Log: Resource management operations
GCP Audit Logs: Admin, data, system, policy activity
Volume: 50MB–5GB/day (S3 data events can be massive — use selective logging)
Log Collection Architecture
LOG COLLECTION ARCHITECTURE
============================
Collection Pattern 1: Agent-Based (Recommended for servers)
Server → Log Agent (Filebeat/Fluent Bit) → Log Shipper (Logstash/Fluentd) → Broker (Kafka/Redis) → Index (Elasticsearch/S3)
Pros: Reliable, handles network issues, buffer locally, can enrich at source
Cons: Agent maintenance on each server, version management
Best for: On-prem servers, EC2 instances, VMs, hybrid environments
Collection Pattern 2: Cloud-Native (Recommended for cloud environments)
Cloud Service → Cloud Watch Logs / Azure Monitor / Cloud Logging → Kinesis / Event Hub → Index
Pros: No agents needed, integrated billing, automatic scaling
Cons: Vendor lock-in, limited parsing capabilities, expensive at scale
Best for: Fully cloud-native workloads
Collection Pattern 3: Sidecar Pattern (Recommended for containers/K8s)
Pod [App + Log Sidecar] → DaemonSet Collector → Message Queue → Index
Pros: Container-aware, survives pod restarts, no app changes needed
Cons: Additional resource usage per pod, complexity
Best for: Kubernetes environments
Collection Pattern 4: Syslog Forwarding (For network devices)
Router/Switch/Firewall → Syslog Server (rsyslog) → Log Processor → Index
Pros: Standard protocol, no agent installation needed
Cons: Unencrypted (use syslog+TLS), less reliable, no buffering
Best for: Network devices, legacy systems
Log pipeline reliability requirements:
- Zero data loss: persistent queue between collector and shipper
- Retry logic: exponential backoff for temporary failures
- Dead letter queue: capture unparseable logs for manual review
- Pipeline monitoring: track ingestion rate, lag, error rate
- Alerting: pipeline down, lag > 5 minutes, error rate > 1%
Log format standardization:
Recommended fields for all logs (normalized):
timestamp (ISO 8601)
level (DEBUG, INFO, WARN, ERROR, FATAL)
message (free-text description)
source (hostname, service name, IP)
service (application/service identifier)
request_id / trace_id (correlation across services)
user_id / username (if applicable)
environment (prod, staging, dev)
version (application version)
metadata (key-value pairs for context)
Log Storage and Retention
LOG STORAGE AND RETENTION POLICY
==================================
Tiered storage architecture:
Tier 1 — Hot Storage (immediate access, indexed):
Retention: 30 days
Technology: Elasticsearch cluster, Splunk Hot DB, Datadog Logs
Performance: < 1 second query response
Cost: $3–$8 per GB/month
Content: Current production logs, active investigation data
Tier 2 — Warm Storage (near-line, compressed):
Retention: 30–90 days
Technology: Elasticsearch cold nodes, Splunk Warm/Cold DB, S3 + Athena
Performance: 5–30 second query response
Cost: $0.50–$2 per GB/month
Content: Recent historical data, trend analysis
Tier 3 — Cold Storage (long-term, archived):
Retention: 1–2 years
Technology: S3 Glacier, Azure Archive, GCP Nearline, Azure Cold Blob
Performance: 3–12 hour retrieval time
Cost: $0.01–$0.05 per GB/month
Content: Compliance archives, long-term trend data
Tier 4 — Legal Hold (indefinite):
Retention: 7+ years or per regulation
Technology: Immutable storage (S3 Object Lock, WORM compliance)
Cost: $0.01–$0.03 per GB/month
Content: Financial audit logs, security incident evidence, regulatory data
Retention by log type (minimum requirements):
Security/audit logs: 1 year minimum (hot + warm), 7 years archive (compliance)
Application error logs: 90 days (covers release cycles + support period)
Application access logs: 30 days hot, 1 year archive
System/OS logs: 30 days (unless needed for specific compliance)
Network logs: 90 days (for security investigation window)
Database audit logs: 1 year (PCI-DSS requires 1 year minimum)
Cloud audit logs: 90 days minimum (CloudTrail/Activity Log)
Debug/trace logs: 7 days (high volume, short-term troubleshooting only)
Storage cost estimation:
Medium company (500 servers, 50 services):
Daily log volume: 500GB–2TB/day
Monthly volume: 15TB–60TB/month
Annual volume: 180TB–720TB/year
Hot storage (30 days, 15TB): $45–$120K/month → $540K–$1.44M/year
Warm storage (60 days, 30TB): $15–$60K/month → $180K–$720K/year
Cold storage (365 days): $1.5K–$6K/month → $18K–$72K/year
Total annual log storage: $738K–$2.23M (varies significantly by platform)
Cost optimization:
- Log sampling for DEBUG/INFO levels (1 in 100) → 99% volume reduction
- Field pruning: extract and index only needed fields
- Compression: Snappy, Zstd (5:1–10:1 ratio on text logs)
- Deduplication: collapse repeated log messages
- Auto-delete non-production logs after 7 days
Log Analysis and Investigation
LOG-BASED ANALYSIS FRAMEWORK
==============================
Operational analysis (daily monitoring):
Error rate dashboard:
- Application errors by service (group by service, level)
- Error rate trend (hourly, daily, weekly)
- Error correlation with deployments (overlay deployment timeline)
- Top error messages (frequency analysis)
- Alert: error rate > 1% of total requests, or > 10 errors/minute
Performance analysis:
- Response time percentiles (p50, p95, p99) from access logs
- Slow query analysis from database logs (queries > 1 second)
- Connection pool exhaustion events
- Memory usage correlation with error patterns
- Alert: p99 response time > 2 seconds, or 50% increase vs baseline
Availability monitoring:
- Service health from application heartbeat logs
- Upstream dependency failures from error logs
- Database connection failures
- DNS resolution errors
- Alert: service health check failures > 3 consecutive
Security analysis (continuous monitoring):
Authentication anomalies:
- Failed login attempts > 5 per minute per IP (brute force detection)
- Login from unusual geographic location
- Login at unusual time (outside business hours)
- Privilege escalation attempts (sudo failures, admin access from non-admin)
- Alert: any of above patterns
Data access patterns:
- Unusual data download volumes (potential exfiltration)
- Access to sensitive endpoints from unauthorized IPs
- Database queries accessing sensitive tables (PII, financial data)
- API key usage anomalies (new IP, volume spike)
- Alert: data access volume > 3 standard deviations from baseline
Network security:
- Port scanning patterns (multiple ports from single IP in short time)
- DNS tunneling detection (excessive DNS queries with encoded data)
- Unencrypted protocol usage (HTTP, FTP, Telnet)
- Firewall block rate increase
- Alert: any port scan pattern, DNS tunneling signature
Incident investigation workflow:
Step 1: Identify the scope
- What time range? (when was issue first reported)
- Which systems affected? (affected services, servers, regions)
- What is the impact? (users affected, revenue impact)
Step 2: Gather evidence
- Search logs from affected systems for error patterns
- Follow correlation IDs/trace IDs across services
- Check deployment history (was there a recent change?)
- Review infrastructure changes (scaling, configuration)
Step 3: Root cause analysis
- Identify the first error in the chain
- Trace dependency failures
- Correlate with external events (DDoS, cloud outage, certificate expiry)
- Review code changes (git log, PR history)
Step 4: Documentation
- Document timeline with timestamps
- Capture key log snippets as evidence
- Create incident report with RCA
- Update runbooks and detection rules
Common log queries (saved searches):
# Find all errors for a service in last hour
service:payment-service level:error | stats count by message | sort -count
# Find failed logins in last 15 minutes
source:auth level:error "failed login" | stats count by ip, username | sort -count
# Track deployment correlation with errors
(deployment OR error) service:api | timespan 5m | stats count by type
# Identify slow queries
source:database query_time:>1000 | stats avg(query_time), count by query_hash | sort -avg(query_time)
# Detect data exfiltration pattern
source:firewall action:allow bytes_out:>100000000 | stats sum(bytes_out) by dst_ip | sort -sum(bytes_out)
Log-Based Alerting
LOG-BASED ALERTING CONFIGURATION
===================================
Alert rules (examples by priority):
P1 — Critical (page on-call immediately):
- Production service error rate > 5% for 2 minutes
- Authentication breach detected (successful login after 50+ failures)
- Data loss event detected (database corruption, RDS failover)
- Ransomware indicators (mass file encryption in logs)
- Channel: PagerDuty/Opsgenie → phone call + SMS + email
- Response SLA: 5 minutes
P2 — High (notify team within 15 minutes):
- Production service error rate > 1% for 5 minutes
- Database connection pool at 90% capacity
- Disk usage > 85% on production servers
- SSL certificate expiring within 7 days
- Cloud cost anomaly (> 20% above baseline)
- Channel: Slack/Teams #incidents channel + email
- Response SLA: 15 minutes
P3 — Medium (notify team within 1 hour):
- Staging environment errors
- Non-critical service degraded performance
- Unusual but not alarming security events
- Log pipeline lag > 5 minutes
- Channel: Slack/Teams team channel
- Response SLA: 1 hour
P4 — Low (daily summary):
- Non-production warning logs
- Informational compliance events
- Capacity trends approaching thresholds
- Channel: Daily digest email
- Response SLA: Next business day
Alert tuning best practices:
- Require minimum duration before alerting (avoid flapping)
- Implement alert deduplication (same root cause = single alert)
- Use runbook links in every alert
- Track alert-to-action ratio (goal: > 80% true positive)
- Review and tune alerts monthly (disable stale alerts)
- Never alert on metrics that cannot be acted upon
- Implement alert escalation (P1: 5min → 15min → 30min escalation)
Compliance and Audit
LOG COMPLIANCE REQUIREMENTS
=============================
PCI-DSS (Payment Card Industry Data Security Standard):
Requirement 10: Track and monitor all access
- Maintain audit trail for all system components
- Log all access to system components and cardholder data
- Log all actions taken by any individual with SUID/privileged access
- Log all actions to audit trails themselves (prevent log tampering)
- Review logs for anomalies daily
- Retain audit trail history for at least 1 year (3 months immediately available)
- Implement automated log review tools
- Implement time-synchronization (NTP) with 100ms accuracy
- Protect audit trails from modification
Estimated log volume: 500MB–5GB/day for PCI scope systems
HIPAA (Health Insurance Portability and Accountability Act):
Audit Controls (45 CFR 164.312(b)):
- Implement hardware, software, and/or procedural mechanisms to record and examine activity in information systems
- Log: access, creation, modification, deletion, and transmission of ePHI
- Maintain logs for 6 years minimum
- Ensure logs are tamper-evident or tamper-proof
- Monitor and review logs regularly
Estimated log volume: 1GB–10GB/day for healthcare systems
SOX (Sarbanes-Oxley Act):
Section 404: Internal controls over financial reporting
- Log all access to financial systems
- Track data changes in financial databases
- Maintain immutable audit trails
- Review access permissions quarterly
- Retain logs for 7 years minimum
GDPR (General Data Protection Regulation):
Article 30: Records of processing activities
- Log data processing activities
- Track data subject access requests
- Log data breaches (document within 72 hours of awareness)
- Retain processing records as required by member state law
SOC 2 Type II:
- Common Criteria: Security, Availability, Processing Integrity, Confidentiality, Privacy
- Log all access to systems (successful and unsuccessful)
- Retain logs for minimum 1 year
- Demonstrate log review procedures to auditors
- Show evidence of anomaly detection and response
Integration Points
- ELK Stack (Elasticsearch, Logstash, Kibana): Open-source log management; Elasticsearch for indexing/search, Logstash for parsing/transforming, Kibana for visualization
- Splunk: Enterprise log management; powerful search language (SPL), machine learning toolkit, security apps (Splunk ES)
- Datadog Logs: Cloud-native log management; integrated with APM, infrastructure monitoring, synthetic monitoring
- Sumo Logic: Cloud-native log analytics; pre-built parsers, ML-powered anomaly detection, security analytics
- Grafana Loki: Lightweight log aggregation; designed to work with Prometheus metrics; low storage cost
- Cloud-native (CloudWatch Logs, Azure Monitor Logs, GCP Logging): Platform-specific log management; no agent needed for managed services
- Fluentd/Fluent Bit: Universal log collectors; 1000+ plugins; widely used in Kubernetes
- Kafka: Message queue for log pipeline; handles high throughput; enables multiple consumers
- Graylog: Open-source log management; built on Elasticsearch; enterprise features available
- PagerDuty/Opsgenie: Alerting and incident management; integrate with log platforms for automated alerting
Edge Cases
- High-volume log sources (load balancers, CDN, API gateways generating 100GB+/day): Use log sampling (e.g., 1 in 100 for access logs, 100% for errors only); implement pre-aggregation at source; use streaming analytics for real-time metrics; archive raw logs to S3 Glacier for forensic access if needed; budget: $500–$5,000/month for high-volume log storage
- Nginx access logs: sample at 1% for trend analysis, 100% for errors
- API gateway logs: sample at 10% for normal traffic, 100% for 4xx/5xx responses
- CloudTrail: use Event Selectors to log only specific APIs (reduces 80%+ of management events)
- Multi-region/global log aggregation: Deploy regional log collectors to avoid cross-region data transfer costs; aggregate to central index for unified search; consider data residency requirements (GDPR: EU logs must stay in EU); use CDN for log shipping to reduce latency
- Regional collectors: 3–5 regions, each with Filebeat/Fluent Bit → regional Logstash
- Central index: Elasticsearch cluster in primary region with cross-cluster search
- Cross-region transfer cost: AWS charges $0.02–$0.06/GB between regions
- Log tampering detection: Implement WORM (write-once-read-many) storage for audit logs; use blockchain or Merkle tree for tamper-proof verification; alert on any log gap, sequence number skip, or backward timestamp; separate log management system from production systems (air-gapped log server for critical compliance)
- Sysmon with log forwarding to isolated SIEM
- S3 Object Lock for cloud-based immutable log storage
- Regular integrity checks: hash verification of log files
- Log schema evolution: As applications change, log formats change — break parsers; implement schema registry for log events; use flexible parsing (regex with fallback); version log schemas; alert on parse failure rate increase (> 5% unparseable = investigation needed)
- Use OpenTelemetry log schema for standardization
- Maintain backward-compatible log formats
- Test parser changes in staging before production deployment
- Cost runaway prevention: Cloud log ingestion can spike unexpectedly (debug logging enabled in production, log loop in application); implement ingestion budgets with hard limits; alert on daily spend exceeding 110% of forecast; use auto-sampling to reduce volume automatically when cost thresholds are hit
- Set Datadog/Splunk ingestion cap at 120% of monthly budget
- Implement log level governance: ERROR and WARN always, INFO sampled, DEBUG only in dev/staging
- Monthly cost review: top 10 log sources by volume and cost
- Real-time vs. batch log processing: Security incidents need real-time log analysis (< 10 seconds); compliance reporting can use batch processing (daily); implement dual pipeline: hot path for real-time (Kafka → Elasticsearch) and cold path for archival (Kafka → S3); allocate resources accordingly
- Real-time pipeline: process security, error, and P1 alert logs
- Batch pipeline: aggregate metrics, generate daily reports, compliance evidence collection