IT AI Skill
Automation Orchestration
Design, build, and manage IT automation workflows and orchestration pipelines for infrastructure provisioning, incident response, configuration management, and operational tasks. Use when creating automation playbooks, building orchestration workflows, impl...
Automation & Orchestration
Design and implement automated workflows that reduce manual operations and enable self-healing infrastructure.
Workflow
1. Automation Strategy & Framework
- Automation opportunity identification:
- Audit current manual processes (IT operations, security response, deployment, provisioning)
- Score automation candidates by: frequency, time savings, error reduction, complexity
- Prioritize: repetitive tasks (>weekly), high-error processes, critical-path operations
- Estimate ROI: (hours saved × hourly rate) / implementation cost
- Automation framework design:
- Define automation standards: coding standards, testing requirements, version control
- Select automation tools by use case (Ansible for config, Terraform for infra, Python for custom)
- Design automation library: reusable modules, shared templates, common patterns
- Establish change management process for automation code
- Automation governance:
- Code review process for all automation playbooks
- Testing requirements: unit tests, integration tests, dry-run validation
- Rollback capability for all automation workflows
- Monitoring and alerting for automation execution
2. Infrastructure as Code (IaC) Automation
- IaC development and management:
- Write infrastructure definitions (Terraform, CloudFormation, Pulumi)
- Modular code structure: reusable modules for common patterns
- Environment-specific variable files (dev, staging, production)
- State management with locking and backup
- IaC CI/CD pipeline:
- Automated validation:
terraform plan, syntax checking, policy compliance (OPA/Sentinel) - Security scanning: tfsec, checkov, terrafmt
- Cost estimation for proposed changes
- Automated apply for approved changes
- State drift detection and alerting
- Drift management:
- Scheduled drift detection scans (daily minimum)
- Auto-remediation for approved drift (configuration drift)
- Alert on unauthorized changes (manual modifications)
- Drift resolution workflow with approval
- Drift trend analysis for recurring issues
3. Configuration Management Automation
- Configuration baseline enforcement:
- Define desired state for all systems (OS, applications, security settings)
- Automated configuration assessment (agent-based or agentless)
- Auto-remediation for configuration drift
- Compliance reporting against baselines
- Patch and update automation:
- Automated patch scanning and assessment
- Staged deployment: test → staging → production
- Maintenance window scheduling
- Automated rollback on failed patches
- Patch compliance reporting
- Application deployment automation:
- Zero-downtime deployment strategies (blue-green, canary, rolling)
- Health check integration with deployment pipeline
- Automated rollback on post-deployment failures
- Configuration injection per environment
- Deployment notification and logging
4. Incident Response Automation
- Automated incident playbooks:
- Define playbooks for common incident types (DDoS, malware, outage, data breach)
- Automated containment: isolate systems, block IPs, disable accounts
- Automated evidence collection: logs, screenshots, memory dumps
- Auto-create incident tickets with full context
- Escalation based on incident severity and resolution status
- Self-healing workflows:
- Auto-restart failed services after health check failure
- Auto-scale during resource exhaustion
- Auto-clear disk space (log rotation, temp file cleanup)
- Auto-failover to standby systems
- Auto-reprovision failed containers/pods
- Security response automation (SOAR):
- Phishing response: auto-quarantine email, extract IOCs, scan organization
- Malware response: isolate endpoint, scan connected systems, block C2 domains
- Credential compromise: force password reset, revoke sessions, alert user
- DDoS response: enable rate limiting, engage DDoS mitigation service
- Alert fatigue reduction: auto-correlation and suppression of related alerts
5. Operational Task Automation
- Provisioning and deprovisioning:
- Employee onboarding: automated account creation, access assignment, equipment ordering
- Employee offboarding: account disable, access revocation, data archiving
- System provisioning: server builds, container deployments, database creation
- Service decommissioning: data backup, DNS removal, resource cleanup
- Routine maintenance automation:
- Log rotation and archival (scheduled daily)
- Database maintenance: vacuum, index rebuild, statistics update
- Backup verification and integrity testing
- Certificate renewal and deployment
- Health check runs and reporting
- Report and notification automation:
- Scheduled report generation and distribution
- Automated summary emails for operations teams
- Dashboard refresh and data pipeline execution
- Alert escalation and notification routing
Templates & Frameworks
Automation Pipeline Structure
AUTOMATION PIPELINE — Infrastructure Change
============================================
Stage 1: Code Commit
→ Git push triggers CI pipeline
→ Linting and formatting validation
→ Unit tests execution
Stage 2: Plan & Validate
→ terraform plan (dry run)
→ Security policy scan (tfsec + OPA)
→ Cost impact estimation
→ Approval workflow (if production change)
Stage 3: Test Environment
→ Apply to test environment
→ Integration tests
→ Smoke tests against live system
→ Performance validation
Stage 4: Staging Environment
→ Apply to staging
→ Full regression test suite
→ Security scan of deployed infrastructure
→ Stakeholder approval
Stage 5: Production Deployment
→ Apply to production (maintenance window)
→ Health check validation
→ Monitoring alert review (30 min observation)
→ Success notification or auto-rollback
Stage 6: Post-Deployment
→ Update documentation
→ Verify drift detection baseline
→ Archive deployment artifacts
→ Generate change report
Self-Healing Workflow Examples
SELF-HEALING WORKFLOWS
=======================
Service Down → Auto-Restart:
Trigger: Health check failure (3 consecutive failures)
Action: Restart service via systemd/container runtime
Validation: Health check passes within 60 seconds
Escalation: If restart fails after 3 attempts → page on-call engineer
Disk Space Critical (>90%):
Trigger: Disk usage alert
Action: Rotate logs → Clear temp files → Archive old data
Validation: Disk usage drops below 80%
Escalation: If disk remains >90% after cleanup → page on-call
Memory Pressure (>85% for 5 min):
Trigger: Memory utilization alert
Action: Identify top memory consumers → Gracefully restart non-critical services
Validation: Memory drops below 75%
Escalation: If OOM killer triggered → page on-call immediately
Database Connection Pool Exhaustion:
Trigger: Connection pool at 95% capacity
Action: Kill idle connections (>30 min) → Increase pool size by 20%
Validation: Pool utilization drops below 70%
Escalation: If pool exhaustion persists → page DBA on-call
Integration Points
- IaC tools (Terraform, CloudFormation, Pulumi, Crossplane): Infrastructure provisioning
- Configuration management (Ansible, Puppet, Chef, SaltStack): System configuration
- CI/CD platforms (Jenkins, GitLab CI, GitHub Actions): Pipeline execution
- SOAR platforms (Cortex XSOAR, Phantom, Tines): Security automation
- Orchestration (Kubernetes, Apache Airflow, Prefect, Temporal): Workflow orchestration
- Monitoring (Prometheus, Datadog, Nagios): Trigger sources for automation
- ITSM (ServiceNow, Jira): Change management, incident tracking
- Version control (GitHub, GitLab, Bitbucket): IaC and automation code storage
- Policy engines (OPA, Sentinel): Governance and compliance automation
Edge Cases
- Automation failure causing incident: Immediate halt of automation pipeline; manual intervention; post-incident review to improve safeguards; add pre-flight checks
- Complex multi-dependency changes: Dependency graph analysis before execution; staged deployment with validation between stages; longer rollback window
- Legacy system automation: API wrapper development for legacy systems; screen scraping as last resort; gradual modernization plan
- Compliance-constrained automation: All automation actions logged and auditable; approval gates for compliance-sensitive changes; evidence generation
- Cascading automation failures: Circuit breaker patterns; maximum retry limits; human-in-the-loop for critical paths; isolation between automation domains
Output
Automation Dashboard
AUTOMATION OPS — Real-Time
===========================
ACTIVE WORKFLOWS:
Running: 23
Completed (last 24h): 187
Failed (last 24h): 3 (1.6% failure rate)
Self-healing triggers (last 24h): 12
IaC STATUS:
Drift detected: 2 resources (auto-remediation in progress)
Pending changes: 4 (in approval)
Last full sync: 2025-04-15 06:00 UTC
CONFIGURATION COMPLIANCE:
Systems compliant: 94% (1,028/1,093)
Drift events (24h): 7
Auto-remediated: 5
Manual review needed: 2
AUTOMATION COVERAGE:
Processes automated: 47/72 (65%)
Estimated hours saved/week: 142
MTTR reduction: 67%
FAILED WORKFLOWS REQUIRING ATTENTION:
🔴 Production DB migration — failed at step 3 (rollback executed)
⚠ Certificate deployment to staging — retry scheduled
⚠ Log archival for archive server — disk full, needs manual cleanup
Trigger Phrases
"automation", "orchestration", "infrastructure as code", "IaC", "Terraform", "Ansible", "automated remediation", "self-healing", "configuration management", "playbook", "drift detection", "auto-provisioning", "automated deployment", "SOAR", "incident automation", "patch automation", "self-service provisioning", "workflow automation"