Support AI Skill
Proactive Service Notifications
Proactively notify customers about outages, maintenance, and known issues before they report them. Manage incident communications, status pages, and post-mortem sharing to maintain trust and reduce inbound support volume. Use when setting up outage alerts,...
Proactive Service Notifications
Notify customers about outages, maintenance, and known issues before they report them.
Workflow
Incident Communication Process
Trigger: Service degradation detected; scheduled maintenance; customer-impacting incident:
- Incident detection: Engineering or monitoring system detects issue; classify severity (P1–P4); identify affected services and customer segments.
- Initial assessment: Within 15 minutes — root cause hypothesis; estimated resolution time; impact scope (% of customers affected).
- Notification drafting: Use approved template; include: what's affected, impact description, ETA, what customer should do; maintain transparent but reassuring tone.
- Multi-channel broadcast:
- P1 (Critical): Email + in-app banner + status page + SMS (enterprise customers only) + social media
- P2 (High): Email + in-app banner + status page + social media
- P3 (Medium): In-app banner + status page
- P4 (Low): Status page only
- Update cadence: P1 — every 30 minutes; P2 — every 60 minutes; P3 — every 2 hours; all updates include progress, revised ETA, next update time.
- Resolution notification: "All clear" message — what was affected, root cause (brief), what was fixed, prevention steps, any compensation (if SLA breached).
- Post-incident review: Within 48 hours — detailed post-mortem (timeline, root cause, impact, action items); share publicly for P1/P2; internal for P3/P4.
- Support coordination: Alert support team of incident; prepare canned responses; flag incoming related tickets as "known issue"; auto-close tickets when resolved.
Notification Templates
INCIDENT NOTIFICATION TEMPLATES
=================================
Template 1: Initial Alert (P1 Critical)
Subject: [Service Alert] We're experiencing issues with [Service Name]
Body:
Hi [Customer Name],
We're currently experiencing an issue with [Service Name] that is affecting [describe impact — e.g., "the ability to process payments"].
What we know:
- Started: [Time, timezone]
- Affected: [X]% of users / [Specific region/product]
- Status: Investigating
What we're doing:
Our engineering team is actively investigating and working on a resolution.
Next update: Within 30 minutes at [Time + 30 min]
You can track progress on our status page: [link]
We apologize for the inconvenience and appreciate your patience.
— [Company] Support Team
Template 2: Progress Update
Subject: [Update] [Service Name] — Investigation in Progress
Body:
Hi [Customer Name],
Here's an update on the [Service Name] issue:
What changed:
- We've identified the root cause: [brief explanation]
- Impact: [updated scope]
- ETA for resolution: [Time] or "Still investigating"
What you can do:
[Workaround if available, or "No action needed — we're working on it"]
Next update: [Time]
Status page: [link]
— [Company] Support Team
Template 3: Resolution Notification
Subject: [Resolved] [Service Name] is back to normal
Body:
Hi [Customer Name],
The issue with [Service Name] has been resolved. Service is back to normal as of [Time].
What happened:
[Brief, non-technical explanation of root cause]
What we fixed:
[Brief description of fix]
What we're doing to prevent this:
[1–2 action items from post-mortem]
[If SLA breached]:
Your SLA credit of [amount] will be applied to your account within [X] business days.
We appreciate your patience and understanding. If you experience any ongoing issues, please reply to this email or contact support.
— [Company] Support Team
Template 4: Scheduled Maintenance
Subject: [Notice] Scheduled maintenance on [Date] — [Service Name]
Body:
Hi [Customer Name],
We'll be performing scheduled maintenance on [Service Name] on:
Date: [Date]
Time: [Start time] – [End time] (timezone)
Expected impact:
[Service will be unavailable / Degraded performance / No impact]
What you should do:
[Save work before start time / No action needed / Alternative process]
If this maintenance window doesn't work for you, please contact us by [deadline] to discuss options.
Details: [link to maintenance page]
— [Company] Support Team
Incident Severity Classification
INCIDENT SEVERITY MATRIX
==========================
P1 — Critical
Criteria: Complete service outage; data loss risk; security breach; >25% of customers affected
Response: War room assembled within 15 minutes; CEO/CTO notified; hourly executive updates
Communication: Email + in-app + status page + SMS (enterprise) + social media
Update frequency: Every 30 minutes
Target resolution: 2 hours
Post-mortem: Public; within 48 hours
P2 — High
Criteria: Major feature unavailable; degraded performance; 10–25% of customers affected
Response: Engineering on-call engaged within 30 minutes; VP notified
Communication: Email + in-app + status page + social media
Update frequency: Every 60 minutes
Target resolution: 4 hours
Post-mortem: Public; within 72 hours
P3 — Medium
Criteria: Minor feature issue; limited customer impact; <10% affected
Response: Engineering team triaged within 2 hours
Communication: In-app banner + status page
Update frequency: Every 2 hours
Target resolution: 8 hours
Post-mortem: Internal; within 1 week
P4 — Low
Criteria: Cosmetic issue; very limited impact; workarounds available
Response: Normal triage; bug logged
Communication: Status page only
Update frequency: None (resolved in next deployment)
Target resolution: Next release cycle
Post-mortem: None required
Edge Cases
- Extended outage (no ETA after 4+ hours):
- Communication: Shift from "we're working on it" to "here's what we know and don't know"; be transparent about uncertainty
- Escalation: Executive involvement (CEO/CTO sends personal note to enterprise customers)
- Compensation: Pre-approve SLA credits; consider proactive credits (don't wait for customer to ask)
- Alternatives: Provide workaround or alternative service; temporary access to competitor service (rare but seen in extreme cases)
- Cadence: Increase update frequency to every 15 minutes to show active engagement
- Cascading incidents (one incident triggers multiple system failures):
- Communication: Single incident thread with sub-issues; avoid multiple separate emails
- Priority: Address customer-facing impact first (even if root cause is in backend system A)
- Coordination: Single incident commander; unified communication channel
- Example: Database migration fails → API down → payments fail → reporting dashboard errors
- Customer message: "We're experiencing issues with our payment processing system" (not technical details)
- Maintenance window overrun (maintenance takes longer than expected):
- Communication: Alert 30 minutes before planned end time if overrun expected
- Transparency: Explain why overrun occurred; provide revised ETA
- Compensation: Consider credit if significant overrun (> 2 hours) even for P3
- Prevention: Add 50% buffer to maintenance window estimates; schedule during lowest-traffic hours
- Regional outage (only certain geographies affected):
- Targeting: Use customer data to segment notifications (only notify affected regions)
- Accuracy: Verify affected regions before broadcasting; risk of notifying unaffected customers creates unnecessary concern
- Time zone: Send notifications in local time when possible; status page always available
- Example: "This issue affects users in the EU region"
- False positive detection (monitoring alerts but no actual customer impact):
- Verification: Confirm customer impact before sending notifications (avoid "boy who cried wolf")
- Threshold: Send notifications only when confirmed customer-facing impact exists
- Internal vs. external: Use internal alerts for potential issues; external notifications only for confirmed impact
- Recovery: If notification sent but impact was minimal, send brief "false alarm" message and apologize
Integration Points
- Monitoring tools: Datadog, PagerDuty, New Relic, Sentry — incident detection, alerting
- Status page platforms: Atlassian Statuspage, Better Uptime, Statuspal — public status page, subscriber notifications
- Email platforms: SendGrid, Mailgun, Amazon SES — notification delivery
- In-app messaging: Intercom, Customer.io, Iterable — in-app banners, contextual notifications
- Social media: Twitter/X, LinkedIn — public incident updates
- Help desk: Zendesk, Freshdesk — ticket flagging, canned responses, auto-close related tickets
- CRM: Salesforce, HubSpot — customer segmentation, enterprise customer identification
- Collaboration: Slack, Teams — internal incident war room, engineering coordination
- Incident management: Jira Service Management, ServiceNow ITSM — incident tracking, post-mortem
- Data warehouse: Snowflake, BigQuery — incident analytics, MTTR tracking, trend analysis