Incident postmortem that prevents blame and drives fixes
Views
8.1K
Copies
2.8K
Likes
969
Comments
0
Copy rate
35.0%
Prompt
You are a site reliability engineer and technical writer. Write a blameless postmortem for this incident:
[incident_summary]
Follow this structure:
1. Executive summary (what happened, impact, resolution time)
2. Timeline (minute-by-minute from detection to resolution)
3. Root cause analysis (use the "5 Whys" technique)
4. Impact assessment (users affected, revenue impact, SLA breach)
5. What went well (don't skip this — it matters for morale)
6. What went wrong
7. Action items with owners and due dates
8. Lessons learned
Tone: factual, blameless, focused on systems not people. Replace names with roles.Customise this prompt
Fill in 1 variable to personalise this prompt
Preview
You are a site reliability engineer and technical writer. Write a blameless postmortem for this incident:
[incident_summary]
Follow this structure:
1. Executive summary (what happened, impact, resolution time)
2. Timeline (minute-by-minute from detection to resolution)
3. Root cause analysis (use the "5 Whys" technique)
4. Impact assessment (users affected, revenue impact, SLA breach)
5. What went well (don't skip this — it matters for morale)
6. What went wrong
7. Action items with owners and due dates
8. Lessons learned
Tone: factual, blameless, focused on systems not people. Replace names with roles.
Example output
# Incident Postmortem: 47-Minute API Outage
**Incident ID:** INC-2026-0412
**Date:** April 10, 2026
**Severity:** SEV-1 (customer-facing full outage)
**Author:** On-call SRE
**Status:** Action items in progress
---
## Executive Summary
On April 10, 2026, the CloudBoard API experienced a full outage lasting 47 minutes (14:23 - 15:10 UTC). During this period, all API requests returned 503 errors. The web application, mobile app, and all integrations (Slack, GitHub, etc.) were non-functional.
**Root cause:** A database migration deployed to production dropped a critical index, causing the primary PostgreSQL database to hit 100% CPU under normal query load. The migration passed all staging tests because the staging database has 2% of production data volume.
**Impact:**
- 100% of users affected (approximately 18,400 active sessions)
- Estimated 2,100 failed webhook deliveries to customer integrations
- 0 data loss
- SLA breach: 99.9% monthly uptime SLA was at 99.87% before this incident
**Resolution:** The missing index was recreated and the database recovered automatically once CPU pressure relieved.
---
## Timeline (UTC)
| Time | Event |
|------|-------|
| 14:03 | Database migration `20260410_optimize_tasks` deployed to production via CI/CD pipeline |
| 14:03 | Migration completes successfully (no errors logged) |
| 14:18 | Database CPU begins climbing (crosses 70% threshold — no alert yet) |
| 14:23 | **Database CPU hits 100%.** Connection pool exhausted. API starts returning 503s. |
| 14:23 | PagerDuty fires SEV-1 alert: "API error rate > 50%" |
| 14:25 | On-call SRE acknowledges the page, begins investigation |
| 14:27 | SRE identifies database CPU at 100%. Checks for long-running queries — finds hundreds of sequential scans on the `tasks` table |
| 14:31 | SRE checks recent deployments. Finds migration `20260410_optimize_tasks` ran 20 minutes ago |
| 14:34 | SRE reads the migration file. Discovers it dropped `idx_tasks_user_status` (composite index on `user_id, status`) to "recreate it with better column order" but the CREATE INDEX statement had a typo and silently failed |
| 14:36 | SRE begins recreating the index: `CREATE INDEX CONCURRENTLY idx_tasks_user_status ON tasks (user_id, status)` |
| 14:36 | **Problem:** `CREATE INDEX CONCURRENTLY` cannot run while the database is under extreme load — it times out |
| 14:40 | SRE escalates. Database lead joins the call |
| 14:42 | Decision: temporarily kill the top 50 long-running queries to reduce CPU pressure, then create the index |
| 14:44 | Queries killed. CPU drops to 85% |
| 14:45 | Index creation started (non-concurrent, with brief table lock accepted given the outage) |
| 14:52 | Index creation completes for `tasks` table (8M rows) |
| 14:55 | CPU drops to 35%. API begins recovering |
| 15:00 | Error rate drops below 1%. Most requests succeeding |
| 15:10 | **All clear.** Error rate at 0%. CPU at 22%. Incident resolved. |
| 15:10 | Status page updated: "Resolved" |
**Total outage:** 47 minutes (14:23 - 15:10)
**Time to detect:** 0 minutes (automated alerting)
**Time to resolve:** 47 minutes (25 minutes to diagnose, 22 minutes to fix)
---
## Root Cause Analysis (5 Whys)
**Why did the API go down?**
→ The PostgreSQL primary hit 100% CPU and exhausted its connection pool.
**Why did CPU hit 100%?**
→ Every query on the `tasks` table became a full sequential scan (reading all 8M rows) instead of using an index.
**Why were queries doing sequential scans?**
→ The composite index `idx_tasks_user_status` was missing. It had been dropped by a migration and not successfully recreated.
**Why wasn't the index recreated?**
→ The migration's CREATE INDEX statement had a typo in the column name (`stauts` instead of `status`). PostgreSQL raised a "column does not exist" error, but the migration framework's error was caught by a generic try/catch that logged a warning but didn't fail the migration.
**Why didn't this get caught before production?**
→ Staging database has ~150K rows in the `tasks` table (vs. 8M in production). The sequential scan on 150K rows took 200ms — within acceptable response times. The query planner didn't even use the index on staging because the table was small enough for a sequential scan to be efficient.
---
## Impact Assessment
| Metric | Value |
|--------|-------|
| Duration | 47 minutes |
| Users affected | ~18,400 active sessions |
| Failed API requests | ~340,000 |
| Failed webhook deliveries | ~2,100 |
| Data loss | None |
| Revenue impact | ~$1,200 (estimated from churned trial users) |
| SLA status | Breached (99.87% vs 99.9% target for April) |
| Customer support tickets | 47 filed during outage |
---
## What Went Well
1. **Alerting worked immediately.** PagerDuty fired within seconds of the error rate spike. No detection delay.
2. **On-call response was fast.** SRE acknowledged within 2 minutes and had a working hypothesis within 10 minutes.
3. **Status page was updated promptly.** First status update went out at 14:28 (5 minutes after detection).
4. **No data loss.** The database was overloaded but not corrupted. All in-flight writes were preserved.
5. **Communication was clear.** The #incident Slack channel had real-time updates. No confusion about who was doing what.
---
## What Went Wrong
1. **Migration error handling was too permissive.** The try/catch in the migration framework swallowed a critical error and allowed the migration to "succeed" with a missing index.
2. **No index verification step.** The migration dropped and recreated an index but never verified the new index actually existed.
3. **Staging doesn't catch scale-dependent issues.** A 150K-row table behaves fundamentally differently from an 8M-row table.
4. **No alerting on missing indexes.** We alert on CPU and error rates but not on query plan regressions (index scans → sequential scans).
5. **`CREATE INDEX CONCURRENTLY` failed under load.** Our runbook didn't have a fallback procedure for creating indexes during high-load incidents.
---
## Action Items
| # | Action | Owner | Due Date | Priority |
|---|--------|-------|----------|----------|
| 1 | Migration framework: make all DDL errors fatal (no swallowed exceptions) | Backend Lead | Apr 17 | P0 |
| 2 | Add post-migration index verification step: compare index list before/after | Backend Lead | Apr 17 | P0 |
| 3 | Create a production-scale staging dataset (anonymized 1M+ rows) | Data Eng Lead | Apr 30 | P1 |
| 4 | Add query plan regression alerting: alert when Seq Scan appears on tables > 100K rows | SRE Lead | Apr 24 | P1 |
| 5 | Update incident runbook: add "index creation under load" procedure with fallback steps | On-call SRE | Apr 18 | P1 |
| 6 | Add required CI check: migration files must be reviewed by a database owner | Engineering Manager | Apr 21 | P2 |
| 7 | Evaluate pg_stat_statements monitoring for query performance regression detection | SRE Lead | May 7 | P2 |
---
## Lessons Learned
1. **"It worked on staging" is not proof it works at scale.** Staging environments must represent production data volumes for performance-sensitive operations. We're investing in production-scale staging.
2. **Migrations that modify indexes are high-risk.** They should be treated as carefully as schema changes — separate PR, database team review, and ideally a canary deployment.
3. **Silent failures are worse than loud failures.** The swallowed exception turned a 30-second fix (re-run the CREATE INDEX) into a 47-minute outage. Every error should be loud enough to stop the pipeline.
4. **Preparation beats speed.** Having a runbook entry for "how to create an index during an outage" would have saved 10-15 minutes of live decision-making under pressure.