IT AI Skill

Automation Orchestration

Design, build, and manage IT automation workflows and orchestration pipelines for infrastructure provisioning, incident response, configuration management, and operational tasks. Use when creating automation playbooks, building orchestration workflows, implementing infrastructure as code, automating operational tasks, managing configuration drift, or designing self-healing systems. Triggers on phrases like "automation", "orchestration", "infrastructure as code", "IaC", "automated remediation", "self-healing", "configuration management", "playbook", "Ansible", "Terraform automation".

Automation & Orchestration

Design and implement automated workflows that reduce manual operations and enable self-healing infrastructure.

Workflow

1. Automation Strategy & Framework

Automation opportunity identification:

Audit current manual processes (IT operations, security response, deployment, provisioning)
Score automation candidates by: frequency, time savings, error reduction, complexity
Prioritize: repetitive tasks (>weekly), high-error processes, critical-path operations
Estimate ROI: (hours saved × hourly rate) / implementation cost

Automation framework design:

Define automation standards: coding standards, testing requirements, version control
Select automation tools by use case (Ansible for config, Terraform for infra, Python for custom)
Design automation library: reusable modules, shared templates, common patterns
Establish change management process for automation code

Automation governance:

Code review process for all automation playbooks
Testing requirements: unit tests, integration tests, dry-run validation
Rollback capability for all automation workflows
Monitoring and alerting for automation execution

2. Infrastructure as Code (IaC) Automation

IaC development and management:

Write infrastructure definitions (Terraform, CloudFormation, Pulumi)
Modular code structure: reusable modules for common patterns
Environment-specific variable files (dev, staging, production)
State management with locking and backup

IaC CI/CD pipeline:

Automated validation: terraform plan, syntax checking, policy compliance (OPA/Sentinel)
Security scanning: tfsec, checkov, terrafmt
Cost estimation for proposed changes
Automated apply for approved changes
State drift detection and alerting

Drift management:

Scheduled drift detection scans (daily minimum)
Auto-remediation for approved drift (configuration drift)
Alert on unauthorized changes (manual modifications)
Drift resolution workflow with approval
Drift trend analysis for recurring issues

3. Configuration Management Automation

Configuration baseline enforcement:

Define desired state for all systems (OS, applications, security settings)
Automated configuration assessment (agent-based or agentless)
Auto-remediation for configuration drift
Compliance reporting against baselines

Patch and update automation:

Automated patch scanning and assessment
Staged deployment: test → staging → production
Maintenance window scheduling
Automated rollback on failed patches
Patch compliance reporting

Application deployment automation:

Zero-downtime deployment strategies (blue-green, canary, rolling)
Health check integration with deployment pipeline
Automated rollback on post-deployment failures
Configuration injection per environment
Deployment notification and logging

4. Incident Response Automation

Automated incident playbooks:

Define playbooks for common incident types (DDoS, malware, outage, data breach)
Automated containment: isolate systems, block IPs, disable accounts
Automated evidence collection: logs, screenshots, memory dumps
Auto-create incident tickets with full context
Escalation based on incident severity and resolution status

Self-healing workflows:

Auto-restart failed services after health check failure
Auto-scale during resource exhaustion
Auto-clear disk space (log rotation, temp file cleanup)
Auto-failover to standby systems
Auto-reprovision failed containers/pods

Security response automation (SOAR):

Phishing response: auto-quarantine email, extract IOCs, scan organization
Malware response: isolate endpoint, scan connected systems, block C2 domains
Credential compromise: force password reset, revoke sessions, alert user
DDoS response: enable rate limiting, engage DDoS mitigation service
Alert fatigue reduction: auto-correlation and suppression of related alerts

5. Operational Task Automation

Provisioning and deprovisioning:

Employee onboarding: automated account creation, access assignment, equipment ordering
Employee offboarding: account disable, access revocation, data archiving
System provisioning: server builds, container deployments, database creation
Service decommissioning: data backup, DNS removal, resource cleanup

Routine maintenance automation:

Log rotation and archival (scheduled daily)
Database maintenance: vacuum, index rebuild, statistics update
Backup verification and integrity testing
Certificate renewal and deployment
Health check runs and reporting

Report and notification automation:

Scheduled report generation and distribution
Automated summary emails for operations teams
Dashboard refresh and data pipeline execution
Alert escalation and notification routing

Templates & Frameworks

Automation Pipeline Structure

AUTOMATION PIPELINE — Infrastructure Change
============================================

Stage 1: Code Commit
  → Git push triggers CI pipeline
  → Linting and formatting validation
  → Unit tests execution

Stage 2: Plan & Validate
  → terraform plan (dry run)
  → Security policy scan (tfsec + OPA)
  → Cost impact estimation
  → Approval workflow (if production change)

Stage 3: Test Environment
  → Apply to test environment
  → Integration tests
  → Smoke tests against live system
  → Performance validation

Stage 4: Staging Environment
  → Apply to staging
  → Full regression test suite
  → Security scan of deployed infrastructure
  → Stakeholder approval

Stage 5: Production Deployment
  → Apply to production (maintenance window)
  → Health check validation
  → Monitoring alert review (30 min observation)
  → Success notification or auto-rollback

Stage 6: Post-Deployment
  → Update documentation
  → Verify drift detection baseline
  → Archive deployment artifacts
  → Generate change report

Self-Healing Workflow Examples

SELF-HEALING WORKFLOWS
=======================

Service Down → Auto-Restart:
  Trigger: Health check failure (3 consecutive failures)
  Action: Restart service via systemd/container runtime
  Validation: Health check passes within 60 seconds
  Escalation: If restart fails after 3 attempts → page on-call engineer

Disk Space Critical (>90%):
  Trigger: Disk usage alert
  Action: Rotate logs → Clear temp files → Archive old data
  Validation: Disk usage drops below 80%
  Escalation: If disk remains >90% after cleanup → page on-call

Memory Pressure (>85% for 5 min):
  Trigger: Memory utilization alert
  Action: Identify top memory consumers → Gracefully restart non-critical services
  Validation: Memory drops below 75%
  Escalation: If OOM killer triggered → page on-call immediately

Database Connection Pool Exhaustion:
  Trigger: Connection pool at 95% capacity
  Action: Kill idle connections (>30 min) → Increase pool size by 20%
  Validation: Pool utilization drops below 70%
  Escalation: If pool exhaustion persists → page DBA on-call

Integration Points

IaC tools (Terraform, CloudFormation, Pulumi, Crossplane): Infrastructure provisioning
Configuration management (Ansible, Puppet, Chef, SaltStack): System configuration
CI/CD platforms (Jenkins, GitLab CI, GitHub Actions): Pipeline execution
SOAR platforms (Cortex XSOAR, Phantom, Tines): Security automation
Orchestration (Kubernetes, Apache Airflow, Prefect, Temporal): Workflow orchestration
Monitoring (Prometheus, Datadog, Nagios): Trigger sources for automation
ITSM (ServiceNow, Jira): Change management, incident tracking
Version control (GitHub, GitLab, Bitbucket): IaC and automation code storage
Policy engines (OPA, Sentinel): Governance and compliance automation

Edge Cases

Automation failure causing incident: Immediate halt of automation pipeline; manual intervention; post-incident review to improve safeguards; add pre-flight checks
Complex multi-dependency changes: Dependency graph analysis before execution; staged deployment with validation between stages; longer rollback window
Legacy system automation: API wrapper development for legacy systems; screen scraping as last resort; gradual modernization plan
Compliance-constrained automation: All automation actions logged and auditable; approval gates for compliance-sensitive changes; evidence generation
Cascading automation failures: Circuit breaker patterns; maximum retry limits; human-in-the-loop for critical paths; isolation between automation domains

Output

Automation Dashboard

AUTOMATION OPS — Real-Time
===========================

ACTIVE WORKFLOWS:
  Running: 23
  Completed (last 24h): 187
  Failed (last 24h): 3 (1.6% failure rate)
  Self-healing triggers (last 24h): 12

IaC STATUS:
  Drift detected: 2 resources (auto-remediation in progress)
  Pending changes: 4 (in approval)
  Last full sync: 2025-04-15 06:00 UTC

CONFIGURATION COMPLIANCE:
  Systems compliant: 94% (1,028/1,093)
  Drift events (24h): 7
  Auto-remediated: 5
  Manual review needed: 2

AUTOMATION COVERAGE:
  Processes automated: 47/72 (65%)
  Estimated hours saved/week: 142
  MTTR reduction: 67%

FAILED WORKFLOWS REQUIRING ATTENTION:
  🔴 Production DB migration — failed at step 3 (rollback executed)
  ⚠  Certificate deployment to staging — retry scheduled
  ⚠  Log archival for archive server — disk full, needs manual cleanup

Trigger Phrases

"automation", "orchestration", "infrastructure as code", "IaC", "Terraform", "Ansible", "automated remediation", "self-healing", "configuration management", "playbook", "drift detection", "auto-provisioning", "automated deployment", "SOAR", "incident automation", "patch automation", "self-service provisioning", "workflow automation"

Disclaimer: All rights reserved by Circulos AI. These skills are specifically designed for Claude Code, Claude Cowork, Codex, and OpenClaw. When using or referencing any skill, please provide proper attribution to Circulos AI.