IT AI Skill

It Documentation Cmdb

Build and maintain IT documentation systems including Configuration Management Database (CMDB), technical documentation, runbooks, knowledge bases, and system architecture diagrams. Use when setting up CMDB, creating IT documentation standards, building sys...

IT Documentation & CMDB

Maintain comprehensive IT documentation, configuration management, and operational knowledge bases.

Workflow

  1. Define documentation standards: templates, formats, review cycles, approval processes, and ownership.
  2. Implement CMDB: discover configuration items (CIs), establish relationships, automate data collection.
  3. Create system documentation: architecture diagrams, deployment topology, network diagrams, data flow maps.
  4. Build operational runbooks: step-by-step procedures for common operations, incident response, and maintenance.
  5. Develop knowledge base: troubleshooting guides, FAQs, known errors, workarounds, best practices.
  6. Establish documentation governance: review cadence, change triggers, version control, access management.
  7. Automate documentation where possible: IaC-generated docs, auto-discovered CMDB, API-generated diagrams.
  8. Train team on documentation practices: when to document, how to write, quality standards.
  9. Audit documentation quarterly: completeness, accuracy, accessibility, relevance.
  10. Continuously improve: feedback loops, usage analytics, documentation quality metrics.

CMDB Implementation

CMDB FRAMEWORK
================

Configuration Item (CI) Categories:

  1. Hardware CIs:
     - Servers (physical): model, serial, location, specs, warranty, owner
     - Network devices: router, switch, firewall, load balancer, WAP
     - Storage devices: SAN, NAS, tape library, backup appliance
     - Endpoints: laptops, desktops, phones, tablets, printers
     - Data center: racks, PDUs, UPS, CRAC units

  2. Software CIs:
     - Operating systems: version, patch level, license
     - Applications: version, configuration, dependencies, owner
     - Middleware: web server, app server, message queue
     - Databases: type, version, size, owner, backup status
     - Libraries and frameworks: version, license, known vulnerabilities

  3. Virtual CIs:
     - Virtual machines: host, specs, OS, applications, snapshots
     - Containers: image, version, registry, deployment target
     - Kubernetes: clusters, namespaces, deployments, services
     - Serverless: functions, triggers, dependencies

  4. Cloud CIs:
     - Compute: EC2, VMs, instances (type, size, region, AZ)
     - Storage: S3 buckets, blobs, disks (size, encryption, access)
     - Network: VPCs, subnets, security groups, load balancers
     - Database: RDS, DynamoDB, CosmosDB (type, size, Multi-AZ)
     - Managed services: Lambda, SQS, SNS, EKS (configuration)

  5. Service CIs:
     - Business services: payroll, CRM, email, website, API
     - IT services: DNS, DHCP, NTP, proxy, VPN
     - Sub-services: components that support parent services
     - Dependencies: service-to-service relationships

  6. Document CIs:
     - Policies and procedures
     - Contracts and licenses
     - Certifications and compliance evidence
     - Architecture diagrams and documentation

CI Attributes (standard fields):

  Identity:
    CI_ID: Unique identifier (auto-generated, format: HW-0001, SW-0001, SVC-0001)
    Name: Human-readable name
    Type: Category and sub-category
    Status: Planned, Staged, In Operation, Retired, Decommissioned

  Technical:
    Model/Version: Manufacturer and version
    Serial Number: Manufacturer serial (for hardware)
    IP Address: Primary IP (for networked devices)
    Hostname: Network name
    Location: Data center, rack, U-position (for hardware)
    Specifications: CPU, RAM, storage, network (for compute)

  Ownership:
    Technical Owner: Person responsible for operation
    Business Owner: Person responsible for business value
    Support Group: Team that provides L2/L3 support
    Cost Center: Budget allocation code

  Relationships:
    Depends On: List of CIs this CI depends on
    Supports: List of services this CI supports
    Contains: List of CIs contained within this CI
    Connected To: Network connections and dependencies

  Lifecycle:
    Date Acquired: Purchase or deployment date
    Expected End of Life: Planned retirement date
    Warranty Expiration: Manufacturer warranty end date
    Last Updated: Last modification to CI record
    Created By: Person who created the record
    Approved By: Person who approved the record

CMDB Accuracy Metrics:

  Target: > 95% accuracy for critical CIs; > 85% for all CIs

  Measurement methods:
    1. Automated discovery reconciliation (quarterly)
       - Run discovery tool; compare against CMDB
       - Identify: missing CIs, stale CIs, incorrect attributes
       - Remediation: update CMDB within 5 business days

    2. Audit sample (quarterly)
       - Random sample of 50 CIs; verify each attribute
       - Accuracy = correct attributes / total attributes checked
       - Report results to IT management

    3. Change management correlation (ongoing)
       - Every change request must reference affected CIs
       - Post-change: verify CI attributes updated
       - Alert on changes without CMDB update

Documentation Standards

DOCUMENTATION STANDARDS
=========================

Document types and templates:

  1. System Architecture Document (SAD):
     Purpose: Describe system architecture, components, and interactions
     Audience: Engineers, architects, new team members
     Format: Markdown or Confluence page
     Sections:
       - Overview (1 paragraph)
       - Architecture diagram (visual)
       - Component inventory (table)
       - Data flow description
       - Integration points (APIs, message queues)
       - Security architecture
       - Scaling and performance characteristics
       - Known limitations and technical debt
     Review cycle: Quarterly or on significant change
     Owner: System architect or lead engineer

  2. Operational Runbook:
     Purpose: Step-by-step procedures for routine operations
     Audience: On-call engineers, operations team
     Format: Markdown with numbered steps; copy-paste commands
     Sections:
       - Procedure name and description
       - Prerequisites (access, tools, approvals needed)
       - Estimated time to complete
       - Step-by-step procedure (numbered, with commands)
       - Expected output for each step
       - Verification steps (confirm success)
       - Rollback procedure (if something goes wrong)
       - Escalation path (when to call for help)
     Review cycle: Semi-annually or after each incident
     Owner: Operations team lead

  3. Incident Response Playbook:
     Purpose: Procedures for specific incident types
     Audience: Incident responders, on-call team
     Format: Markdown; link from alert definitions
     Sections:
       - Incident type and trigger conditions
       - Severity classification
       - Immediate actions (first 15 minutes)
       - Diagnosis steps (structured investigation)
       - Resolution options (prioritized by speed/risk)
       - Verification (confirm resolution)
       - Post-incident actions (PIR, monitoring updates)
     Review cycle: After each incident; minimum annually
     Owner: SRE team lead

  4. Knowledge Base Article:
     Purpose: Troubleshooting guide for common issues
     Audience: Support team, end users, engineers
     Format: Structured FAQ with search-friendly content
     Sections:
       - Title (searchable, includes error codes/symptoms)
       - Symptoms (what the user sees)
       - Root cause (why it happens)
       - Resolution (step-by-step fix)
       - Prevention (how to avoid recurrence)
       - Related articles (links)
     Review cycle: Annually or when issue pattern changes
     Owner: Subject matter expert

  5. Standard Operating Procedure (SOP):
     Purpose: Formal process documentation for compliance
     Audience: Auditors, management, process owners
     Format: Document with version control and approval
     Sections:
       - Document ID, version, date
       - Purpose and scope
       - Roles and responsibilities (RACI matrix)
       - Procedure (detailed steps)
       - References (related policies, regulations)
       - Revision history
       - Approval signatures
     Review cycle: Annually or on regulatory change
     Owner: Process owner

  6. Project Documentation:
     Purpose: Track project scope, decisions, and outcomes
     Audience: Project team, stakeholders, auditors
     Format: Project folder with structured documents
     Sections:
       - Project charter (scope, objectives, timeline, budget)
       - Design documents (technical specifications)
       - Meeting notes and decisions
       - Test plans and results
       - Go-live checklist
       - Post-implementation review
     Review cycle: At project closure; archive after 1 year
     Owner: Project manager

Documentation quality checklist:

  ☐ Clear and concise language (avoid jargon where possible)
  ☐ Step-by-step instructions with expected outcomes
  ☐ Screenshots or diagrams where helpful
  ☐ Commands are copy-paste ready (tested and verified)
  ☐ Version number and last updated date visible
  ☐ Owner/contact person identified
  ☐ Links to related documentation
  ☐ Search-friendly title and tags
  ☐ Accessible to intended audience (appropriate technical level)
  ☐ Reviewed and approved by subject matter expert

Runbook Examples

RUNBOOK: Database Failover (PostgreSQL)
=========================================

Trigger: Primary database unresponsive or degraded for > 60 seconds
Severity: SEV-1 (if production)

Prerequisites:
  - SSH access to database servers
  - psql client installed
  - Patroni cluster access
  - Notification channels available (Slack, PagerDuty)

Estimated time: 5–15 minutes

Step 1: Verify Failure

Check if primary is responding

psql -h primary-db.internal -U monitoring -c "SELECT 1;"

Check Patroni status

patronictl -c /etc/patroni/patroni.yml list

Check logs

tail -100 /var/log/postgresql/postgresql-14-main.log

  Expected: Connection refused or timeout; Patroni shows primary as stopped

Step 2: Initiate Failover

Automatic failover (Patroni)

Patroni should auto-failover within 30 seconds

If Patroni not auto-failing, trigger manually:

patronictl -c /etc/patroni/patroni.yml failover primary-db

Verify new primary

patronictl -c /etc/patroni/patroni.yml list

  Expected: New primary elected; cluster shows one leader

Step 3: Update Connection Strings

If using virtual IP / service discovery, no action needed

If using static connection strings, update:

- Application configuration

- Connection poolers (PgBouncer)

- ETL jobs and scheduled tasks

Verify application connectivity

curl -s http://app-health-endpoint/health


Step 4: Verify Data Integrity

Connect to new primary

psql -h new-primary.internal -U monitoring

Check replication status (should be empty — new primary has no replica)

SELECT * FROM pg_stat_replication;

Check database size (compare with pre-failover)

SELECT pg_database_size('production_db');


Step 5: Restore Old Primary as Replica

Once old primary is recovered:

patronictl -c /etc/patroni/patroni.yml restart old-primary

Verify it joins as replica

patronictl -c /etc/patroni/patroni.yml list


Step 6: Verification
  - Application health check passes
  - Error rate returned to baseline (< 0.1%)
  - Database queries executing normally
  - Monitoring alerts cleared

Rollback:
  If failover causes issues, manual switchover back to original primary:
  patronictl -c /etc/patroni/patroni.yml switchover --leader new-primary

Escalation:
  If failover fails after 15 minutes: escalate to DBA team lead
  If data loss suspected: escalate to VP Engineering immediately

Integration Points

Edge Cases