---
name: proactive-service-notifications
description: Proactively notify customers about outages, maintenance, and known issues before they report them. Manage incident communications, status pages, and post-mortem sharing to maintain trust and reduce inbound support volume. Use when setting up outage alerts, managing maintenance windows, publishing status updates, conducting post-incident reviews, or reducing support ticket spikes during incidents. Triggers on phrases like "service notification", "outage alert", "maintenance window", "incident communication", "status page", "system status", "proactive alert", "post-mortem", "incident update", "service degradation".
---

# Proactive Service Notifications

Notify customers about outages, maintenance, and known issues before they report them.

## Workflow

### Incident Communication Process

Trigger: Service degradation detected; scheduled maintenance; customer-impacting incident:

1. **Incident detection**: Engineering or monitoring system detects issue; classify severity (P1–P4); identify affected services and customer segments.
2. **Initial assessment**: Within 15 minutes — root cause hypothesis; estimated resolution time; impact scope (% of customers affected).
3. **Notification drafting**: Use approved template; include: what's affected, impact description, ETA, what customer should do; maintain transparent but reassuring tone.
4. **Multi-channel broadcast**:
   - P1 (Critical): Email + in-app banner + status page + SMS (enterprise customers only) + social media
   - P2 (High): Email + in-app banner + status page + social media
   - P3 (Medium): In-app banner + status page
   - P4 (Low): Status page only
5. **Update cadence**: P1 — every 30 minutes; P2 — every 60 minutes; P3 — every 2 hours; all updates include progress, revised ETA, next update time.
6. **Resolution notification**: "All clear" message — what was affected, root cause (brief), what was fixed, prevention steps, any compensation (if SLA breached).
7. **Post-incident review**: Within 48 hours — detailed post-mortem (timeline, root cause, impact, action items); share publicly for P1/P2; internal for P3/P4.
8. **Support coordination**: Alert support team of incident; prepare canned responses; flag incoming related tickets as "known issue"; auto-close tickets when resolved.

### Notification Templates

```
INCIDENT NOTIFICATION TEMPLATES
=================================

Template 1: Initial Alert (P1 Critical)
Subject: [Service Alert] We're experiencing issues with [Service Name]

Body:
Hi [Customer Name],

We're currently experiencing an issue with [Service Name] that is affecting [describe impact — e.g., "the ability to process payments"].

What we know:
  - Started: [Time, timezone]
  - Affected: [X]% of users / [Specific region/product]
  - Status: Investigating

What we're doing:
  Our engineering team is actively investigating and working on a resolution.

Next update: Within 30 minutes at [Time + 30 min]

You can track progress on our status page: [link]

We apologize for the inconvenience and appreciate your patience.

— [Company] Support Team

Template 2: Progress Update
Subject: [Update] [Service Name] — Investigation in Progress

Body:
Hi [Customer Name],

Here's an update on the [Service Name] issue:

What changed:
  - We've identified the root cause: [brief explanation]
  - Impact: [updated scope]
  - ETA for resolution: [Time] or "Still investigating"

What you can do:
  [Workaround if available, or "No action needed — we're working on it"]

Next update: [Time]

Status page: [link]

— [Company] Support Team

Template 3: Resolution Notification
Subject: [Resolved] [Service Name] is back to normal

Body:
Hi [Customer Name],

The issue with [Service Name] has been resolved. Service is back to normal as of [Time].

What happened:
  [Brief, non-technical explanation of root cause]

What we fixed:
  [Brief description of fix]

What we're doing to prevent this:
  [1–2 action items from post-mortem]

[If SLA breached]:
  Your SLA credit of [amount] will be applied to your account within [X] business days.

We appreciate your patience and understanding. If you experience any ongoing issues, please reply to this email or contact support.

— [Company] Support Team

Template 4: Scheduled Maintenance
Subject: [Notice] Scheduled maintenance on [Date] — [Service Name]

Body:
Hi [Customer Name],

We'll be performing scheduled maintenance on [Service Name] on:
  Date: [Date]
  Time: [Start time] – [End time] (timezone)

Expected impact:
  [Service will be unavailable / Degraded performance / No impact]

What you should do:
  [Save work before start time / No action needed / Alternative process]

If this maintenance window doesn't work for you, please contact us by [deadline] to discuss options.

Details: [link to maintenance page]

— [Company] Support Team
```

### Incident Severity Classification

```
INCIDENT SEVERITY MATRIX
==========================

P1 — Critical
  Criteria: Complete service outage; data loss risk; security breach; >25% of customers affected
  Response: War room assembled within 15 minutes; CEO/CTO notified; hourly executive updates
  Communication: Email + in-app + status page + SMS (enterprise) + social media
  Update frequency: Every 30 minutes
  Target resolution: 2 hours
  Post-mortem: Public; within 48 hours

P2 — High
  Criteria: Major feature unavailable; degraded performance; 10–25% of customers affected
  Response: Engineering on-call engaged within 30 minutes; VP notified
  Communication: Email + in-app + status page + social media
  Update frequency: Every 60 minutes
  Target resolution: 4 hours
  Post-mortem: Public; within 72 hours

P3 — Medium
  Criteria: Minor feature issue; limited customer impact; <10% affected
  Response: Engineering team triaged within 2 hours
  Communication: In-app banner + status page
  Update frequency: Every 2 hours
  Target resolution: 8 hours
  Post-mortem: Internal; within 1 week

P4 — Low
  Criteria: Cosmetic issue; very limited impact; workarounds available
  Response: Normal triage; bug logged
  Communication: Status page only
  Update frequency: None (resolved in next deployment)
  Target resolution: Next release cycle
  Post-mortem: None required
```

## Edge Cases

- **Extended outage** (no ETA after 4+ hours):
  - Communication: Shift from "we're working on it" to "here's what we know and don't know"; be transparent about uncertainty
  - Escalation: Executive involvement (CEO/CTO sends personal note to enterprise customers)
  - Compensation: Pre-approve SLA credits; consider proactive credits (don't wait for customer to ask)
  - Alternatives: Provide workaround or alternative service; temporary access to competitor service (rare but seen in extreme cases)
  - Cadence: Increase update frequency to every 15 minutes to show active engagement

- **Cascading incidents** (one incident triggers multiple system failures):
  - Communication: Single incident thread with sub-issues; avoid multiple separate emails
  - Priority: Address customer-facing impact first (even if root cause is in backend system A)
  - Coordination: Single incident commander; unified communication channel
  - Example: Database migration fails → API down → payments fail → reporting dashboard errors
  - Customer message: "We're experiencing issues with our payment processing system" (not technical details)

- **Maintenance window overrun** (maintenance takes longer than expected):
  - Communication: Alert 30 minutes before planned end time if overrun expected
  - Transparency: Explain why overrun occurred; provide revised ETA
  - Compensation: Consider credit if significant overrun (> 2 hours) even for P3
  - Prevention: Add 50% buffer to maintenance window estimates; schedule during lowest-traffic hours

- **Regional outage** (only certain geographies affected):
  - Targeting: Use customer data to segment notifications (only notify affected regions)
  - Accuracy: Verify affected regions before broadcasting; risk of notifying unaffected customers creates unnecessary concern
  - Time zone: Send notifications in local time when possible; status page always available
  - Example: "This issue affects users in the EU region"

- **False positive detection** (monitoring alerts but no actual customer impact):
  - Verification: Confirm customer impact before sending notifications (avoid "boy who cried wolf")
  - Threshold: Send notifications only when confirmed customer-facing impact exists
  - Internal vs. external: Use internal alerts for potential issues; external notifications only for confirmed impact
  - Recovery: If notification sent but impact was minimal, send brief "false alarm" message and apologize

## Integration Points

- **Monitoring tools**: Datadog, PagerDuty, New Relic, Sentry — incident detection, alerting
- **Status page platforms**: Atlassian Statuspage, Better Uptime, Statuspal — public status page, subscriber notifications
- **Email platforms**: SendGrid, Mailgun, Amazon SES — notification delivery
- **In-app messaging**: Intercom, Customer.io, Iterable — in-app banners, contextual notifications
- **Social media**: Twitter/X, LinkedIn — public incident updates
- **Help desk**: Zendesk, Freshdesk — ticket flagging, canned responses, auto-close related tickets
- **CRM**: Salesforce, HubSpot — customer segmentation, enterprise customer identification
- **Collaboration**: Slack, Teams — internal incident war room, engineering coordination
- **Incident management**: Jira Service Management, ServiceNow ITSM — incident tracking, post-mortem
- **Data warehouse**: Snowflake, BigQuery — incident analytics, MTTR tracking, trend analysis
