---
name: automation-orchestration
description: Design, build, and manage IT automation workflows and orchestration pipelines for infrastructure provisioning, incident response, configuration management, and operational tasks. Use when creating automation playbooks, building orchestration workflows, implementing infrastructure as code, automating operational tasks, managing configuration drift, or designing self-healing systems. Triggers on phrases like "automation", "orchestration", "infrastructure as code", "IaC", "automated remediation", "self-healing", "configuration management", "playbook", "Ansible", "Terraform automation".
---

# Automation & Orchestration

Design and implement automated workflows that reduce manual operations and enable self-healing infrastructure.

## Workflow

### 1. Automation Strategy & Framework

1. **Automation opportunity identification**:
   - Audit current manual processes (IT operations, security response, deployment, provisioning)
   - Score automation candidates by: frequency, time savings, error reduction, complexity
   - Prioritize: repetitive tasks (>weekly), high-error processes, critical-path operations
   - Estimate ROI: (hours saved × hourly rate) / implementation cost

2. **Automation framework design**:
   - Define automation standards: coding standards, testing requirements, version control
   - Select automation tools by use case (Ansible for config, Terraform for infra, Python for custom)
   - Design automation library: reusable modules, shared templates, common patterns
   - Establish change management process for automation code

3. **Automation governance**:
   - Code review process for all automation playbooks
   - Testing requirements: unit tests, integration tests, dry-run validation
   - Rollback capability for all automation workflows
   - Monitoring and alerting for automation execution

### 2. Infrastructure as Code (IaC) Automation

1. **IaC development and management**:
   - Write infrastructure definitions (Terraform, CloudFormation, Pulumi)
   - Modular code structure: reusable modules for common patterns
   - Environment-specific variable files (dev, staging, production)
   - State management with locking and backup

2. **IaC CI/CD pipeline**:
   - Automated validation: `terraform plan`, syntax checking, policy compliance (OPA/Sentinel)
   - Security scanning: tfsec, checkov, terrafmt
   - Cost estimation for proposed changes
   - Automated apply for approved changes
   - State drift detection and alerting

3. **Drift management**:
   - Scheduled drift detection scans (daily minimum)
   - Auto-remediation for approved drift (configuration drift)
   - Alert on unauthorized changes (manual modifications)
   - Drift resolution workflow with approval
   - Drift trend analysis for recurring issues

### 3. Configuration Management Automation

1. **Configuration baseline enforcement**:
   - Define desired state for all systems (OS, applications, security settings)
   - Automated configuration assessment (agent-based or agentless)
   - Auto-remediation for configuration drift
   - Compliance reporting against baselines

2. **Patch and update automation**:
   - Automated patch scanning and assessment
   - Staged deployment: test → staging → production
   - Maintenance window scheduling
   - Automated rollback on failed patches
   - Patch compliance reporting

3. **Application deployment automation**:
   - Zero-downtime deployment strategies (blue-green, canary, rolling)
   - Health check integration with deployment pipeline
   - Automated rollback on post-deployment failures
   - Configuration injection per environment
   - Deployment notification and logging

### 4. Incident Response Automation

1. **Automated incident playbooks**:
   - Define playbooks for common incident types (DDoS, malware, outage, data breach)
   - Automated containment: isolate systems, block IPs, disable accounts
   - Automated evidence collection: logs, screenshots, memory dumps
   - Auto-create incident tickets with full context
   - Escalation based on incident severity and resolution status

2. **Self-healing workflows**:
   - Auto-restart failed services after health check failure
   - Auto-scale during resource exhaustion
   - Auto-clear disk space (log rotation, temp file cleanup)
   - Auto-failover to standby systems
   - Auto-reprovision failed containers/pods

3. **Security response automation (SOAR)**:
   - Phishing response: auto-quarantine email, extract IOCs, scan organization
   - Malware response: isolate endpoint, scan connected systems, block C2 domains
   - Credential compromise: force password reset, revoke sessions, alert user
   - DDoS response: enable rate limiting, engage DDoS mitigation service
   - Alert fatigue reduction: auto-correlation and suppression of related alerts

### 5. Operational Task Automation

1. **Provisioning and deprovisioning**:
   - Employee onboarding: automated account creation, access assignment, equipment ordering
   - Employee offboarding: account disable, access revocation, data archiving
   - System provisioning: server builds, container deployments, database creation
   - Service decommissioning: data backup, DNS removal, resource cleanup

2. **Routine maintenance automation**:
   - Log rotation and archival (scheduled daily)
   - Database maintenance: vacuum, index rebuild, statistics update
   - Backup verification and integrity testing
   - Certificate renewal and deployment
   - Health check runs and reporting

3. **Report and notification automation**:
   - Scheduled report generation and distribution
   - Automated summary emails for operations teams
   - Dashboard refresh and data pipeline execution
   - Alert escalation and notification routing

## Templates & Frameworks

### Automation Pipeline Structure

```
AUTOMATION PIPELINE — Infrastructure Change
============================================

Stage 1: Code Commit
  → Git push triggers CI pipeline
  → Linting and formatting validation
  → Unit tests execution

Stage 2: Plan & Validate
  → terraform plan (dry run)
  → Security policy scan (tfsec + OPA)
  → Cost impact estimation
  → Approval workflow (if production change)

Stage 3: Test Environment
  → Apply to test environment
  → Integration tests
  → Smoke tests against live system
  → Performance validation

Stage 4: Staging Environment
  → Apply to staging
  → Full regression test suite
  → Security scan of deployed infrastructure
  → Stakeholder approval

Stage 5: Production Deployment
  → Apply to production (maintenance window)
  → Health check validation
  → Monitoring alert review (30 min observation)
  → Success notification or auto-rollback

Stage 6: Post-Deployment
  → Update documentation
  → Verify drift detection baseline
  → Archive deployment artifacts
  → Generate change report
```

### Self-Healing Workflow Examples

```
SELF-HEALING WORKFLOWS
=======================

Service Down → Auto-Restart:
  Trigger: Health check failure (3 consecutive failures)
  Action: Restart service via systemd/container runtime
  Validation: Health check passes within 60 seconds
  Escalation: If restart fails after 3 attempts → page on-call engineer

Disk Space Critical (>90%):
  Trigger: Disk usage alert
  Action: Rotate logs → Clear temp files → Archive old data
  Validation: Disk usage drops below 80%
  Escalation: If disk remains >90% after cleanup → page on-call

Memory Pressure (>85% for 5 min):
  Trigger: Memory utilization alert
  Action: Identify top memory consumers → Gracefully restart non-critical services
  Validation: Memory drops below 75%
  Escalation: If OOM killer triggered → page on-call immediately

Database Connection Pool Exhaustion:
  Trigger: Connection pool at 95% capacity
  Action: Kill idle connections (>30 min) → Increase pool size by 20%
  Validation: Pool utilization drops below 70%
  Escalation: If pool exhaustion persists → page DBA on-call
```

## Integration Points

- IaC tools (Terraform, CloudFormation, Pulumi, Crossplane): Infrastructure provisioning
- Configuration management (Ansible, Puppet, Chef, SaltStack): System configuration
- CI/CD platforms (Jenkins, GitLab CI, GitHub Actions): Pipeline execution
- SOAR platforms (Cortex XSOAR, Phantom, Tines): Security automation
- Orchestration (Kubernetes, Apache Airflow, Prefect, Temporal): Workflow orchestration
- Monitoring (Prometheus, Datadog, Nagios): Trigger sources for automation
- ITSM (ServiceNow, Jira): Change management, incident tracking
- Version control (GitHub, GitLab, Bitbucket): IaC and automation code storage
- Policy engines (OPA, Sentinel): Governance and compliance automation

## Edge Cases

- **Automation failure causing incident**: Immediate halt of automation pipeline; manual intervention; post-incident review to improve safeguards; add pre-flight checks
- **Complex multi-dependency changes**: Dependency graph analysis before execution; staged deployment with validation between stages; longer rollback window
- **Legacy system automation**: API wrapper development for legacy systems; screen scraping as last resort; gradual modernization plan
- **Compliance-constrained automation**: All automation actions logged and auditable; approval gates for compliance-sensitive changes; evidence generation
- **Cascading automation failures**: Circuit breaker patterns; maximum retry limits; human-in-the-loop for critical paths; isolation between automation domains

## Output

### Automation Dashboard

```
AUTOMATION OPS — Real-Time
===========================

ACTIVE WORKFLOWS:
  Running: 23
  Completed (last 24h): 187
  Failed (last 24h): 3 (1.6% failure rate)
  Self-healing triggers (last 24h): 12

IaC STATUS:
  Drift detected: 2 resources (auto-remediation in progress)
  Pending changes: 4 (in approval)
  Last full sync: 2025-04-15 06:00 UTC

CONFIGURATION COMPLIANCE:
  Systems compliant: 94% (1,028/1,093)
  Drift events (24h): 7
  Auto-remediated: 5
  Manual review needed: 2

AUTOMATION COVERAGE:
  Processes automated: 47/72 (65%)
  Estimated hours saved/week: 142
  MTTR reduction: 67%

FAILED WORKFLOWS REQUIRING ATTENTION:
  🔴 Production DB migration — failed at step 3 (rollback executed)
  ⚠  Certificate deployment to staging — retry scheduled
  ⚠  Log archival for archive server — disk full, needs manual cleanup
```

## Trigger Phrases

"automation", "orchestration", "infrastructure as code", "IaC", "Terraform", "Ansible", "automated remediation", "self-healing", "configuration management", "playbook", "drift detection", "auto-provisioning", "automated deployment", "SOAR", "incident automation", "patch automation", "self-service provisioning", "workflow automation"
