29 days ago•

Incident postmortem that prevents blame and drives fixes

Views

8.1K

Copies

2.8K

Likes

969

Comments

Copy rate

35.0%

Prompt

You are a site reliability engineer and technical writer. Write a blameless postmortem for this incident:

[incident_summary]

Follow this structure:
1. Executive summary (what happened, impact, resolution time)
2. Timeline (minute-by-minute from detection to resolution)
3. Root cause analysis (use the "5 Whys" technique)
4. Impact assessment (users affected, revenue impact, SLA breach)
5. What went well (don't skip this — it matters for morale)
6. What went wrong
7. Action items with owners and due dates
8. Lessons learned

Tone: factual, blameless, focused on systems not people. Replace names with roles.

Customise this prompt

Fill in 1 variable to personalise this prompt

Preview

You are a site reliability engineer and technical writer. Write a blameless postmortem for this incident: [incident_summary] Follow this structure: 1. Executive summary (what happened, impact, resolution time) 2. Timeline (minute-by-minute from detection to resolution) 3. Root cause analysis (use the "5 Whys" technique) 4. Impact assessment (users affected, revenue impact, SLA breach) 5. What went well (don't skip this — it matters for morale) 6. What went wrong 7. Action items with owners and due dates 8. Lessons learned Tone: factual, blameless, focused on systems not people. Replace names with roles.

Example output

# Incident Postmortem: 47-Minute API Outage **Incident ID:** INC-2026-0412 **Date:** April 10, 2026 **Severity:** SEV-1 (customer-facing full outage) **Author:** On-call SRE **Status:** Action items in progress --- ## Executive Summary On April 10, 2026, the CloudBoard API experienced a full outage lasting 47 minutes (14:23 - 15:10 UTC). During this period, all API requests returned 503 errors. The web application, mobile app, and all integrations (Slack, GitHub, etc.) were non-functional. **Root cause:** A database migration deployed to production dropped a critical index, causing the primary PostgreSQL database to hit 100% CPU under normal query load. The migration passed all staging tests because the staging database has 2% of production data volume. **Impact:** - 100% of users affected (approximately 18,400 active sessions) - Estimated 2,100 failed webhook deliveries to customer integrations - 0 data loss - SLA breach: 99.9% monthly uptime SLA was at 99.87% before this incident **Resolution:** The missing index was recreated and the database recovered automatically once CPU pressure relieved. --- ## Timeline (UTC) | Time | Event | |------|-------| | 14:03 | Database migration `20260410_optimize_tasks` deployed to production via CI/CD pipeline | | 14:03 | Migration completes successfully (no errors logged) | | 14:18 | Database CPU begins climbing (crosses 70% threshold — no alert yet) | | 14:23 | **Database CPU hits 100%.** Connection pool exhausted. API starts returning 503s. | | 14:23 | PagerDuty fires SEV-1 alert: "API error rate > 50%" | | 14:25 | On-call SRE acknowledges the page, begins investigation | | 14:27 | SRE identifies database CPU at 100%. Checks for long-running queries — finds hundreds of sequential scans on the `tasks` table | | 14:31 | SRE checks recent deployments. Finds migration `20260410_optimize_tasks` ran 20 minutes ago | | 14:34 | SRE reads the migration file. Discovers it dropped `idx_tasks_user_status` (composite index on `user_id, status`) to "recreate it with better column order" but the CREATE INDEX statement had a typo and silently failed | | 14:36 | SRE begins recreating the index: `CREATE INDEX CONCURRENTLY idx_tasks_user_status ON tasks (user_id, status)` | | 14:36 | **Problem:** `CREATE INDEX CONCURRENTLY` cannot run while the database is under extreme load — it times out | | 14:40 | SRE escalates. Database lead joins the call | | 14:42 | Decision: temporarily kill the top 50 long-running queries to reduce CPU pressure, then create the index | | 14:44 | Queries killed. CPU drops to 85% | | 14:45 | Index creation started (non-concurrent, with brief table lock accepted given the outage) | | 14:52 | Index creation completes for `tasks` table (8M rows) | | 14:55 | CPU drops to 35%. API begins recovering | | 15:00 | Error rate drops below 1%. Most requests succeeding | | 15:10 | **All clear.** Error rate at 0%. CPU at 22%. Incident resolved. | | 15:10 | Status page updated: "Resolved" | **Total outage:** 47 minutes (14:23 - 15:10) **Time to detect:** 0 minutes (automated alerting) **Time to resolve:** 47 minutes (25 minutes to diagnose, 22 minutes to fix) --- ## Root Cause Analysis (5 Whys) **Why did the API go down?** → The PostgreSQL primary hit 100% CPU and exhausted its connection pool. **Why did CPU hit 100%?** → Every query on the `tasks` table became a full sequential scan (reading all 8M rows) instead of using an index. **Why were queries doing sequential scans?** → The composite index `idx_tasks_user_status` was missing. It had been dropped by a migration and not successfully recreated. **Why wasn't the index recreated?** → The migration's CREATE INDEX statement had a typo in the column name (`stauts` instead of `status`). PostgreSQL raised a "column does not exist" error, but the migration framework's error was caught by a generic try/catch that logged a warning but didn't fail the migration. **Why didn't this get caught before production?** → Staging database has ~150K rows in the `tasks` table (vs. 8M in production). The sequential scan on 150K rows took 200ms — within acceptable response times. The query planner didn't even use the index on staging because the table was small enough for a sequential scan to be efficient. --- ## Impact Assessment | Metric | Value | |--------|-------| | Duration | 47 minutes | | Users affected | ~18,400 active sessions | | Failed API requests | ~340,000 | | Failed webhook deliveries | ~2,100 | | Data loss | None | | Revenue impact | ~$1,200 (estimated from churned trial users) | | SLA status | Breached (99.87% vs 99.9% target for April) | | Customer support tickets | 47 filed during outage | --- ## What Went Well 1. **Alerting worked immediately.** PagerDuty fired within seconds of the error rate spike. No detection delay. 2. **On-call response was fast.** SRE acknowledged within 2 minutes and had a working hypothesis within 10 minutes. 3. **Status page was updated promptly.** First status update went out at 14:28 (5 minutes after detection). 4. **No data loss.** The database was overloaded but not corrupted. All in-flight writes were preserved. 5. **Communication was clear.** The #incident Slack channel had real-time updates. No confusion about who was doing what. --- ## What Went Wrong 1. **Migration error handling was too permissive.** The try/catch in the migration framework swallowed a critical error and allowed the migration to "succeed" with a missing index. 2. **No index verification step.** The migration dropped and recreated an index but never verified the new index actually existed. 3. **Staging doesn't catch scale-dependent issues.** A 150K-row table behaves fundamentally differently from an 8M-row table. 4. **No alerting on missing indexes.** We alert on CPU and error rates but not on query plan regressions (index scans → sequential scans). 5. **`CREATE INDEX CONCURRENTLY` failed under load.** Our runbook didn't have a fallback procedure for creating indexes during high-load incidents. --- ## Action Items | # | Action | Owner | Due Date | Priority | |---|--------|-------|----------|----------| | 1 | Migration framework: make all DDL errors fatal (no swallowed exceptions) | Backend Lead | Apr 17 | P0 | | 2 | Add post-migration index verification step: compare index list before/after | Backend Lead | Apr 17 | P0 | | 3 | Create a production-scale staging dataset (anonymized 1M+ rows) | Data Eng Lead | Apr 30 | P1 | | 4 | Add query plan regression alerting: alert when Seq Scan appears on tables > 100K rows | SRE Lead | Apr 24 | P1 | | 5 | Update incident runbook: add "index creation under load" procedure with fallback steps | On-call SRE | Apr 18 | P1 | | 6 | Add required CI check: migration files must be reviewed by a database owner | Engineering Manager | Apr 21 | P2 | | 7 | Evaluate pg_stat_statements monitoring for query performance regression detection | SRE Lead | May 7 | P2 | --- ## Lessons Learned 1. **"It worked on staging" is not proof it works at scale.** Staging environments must represent production data volumes for performance-sensitive operations. We're investing in production-scale staging. 2. **Migrations that modify indexes are high-risk.** They should be treated as carefully as schema changes — separate PR, database team review, and ideally a canary deployment. 3. **Silent failures are worse than loud failures.** The swallowed exception turned a 30-second fix (re-run the CREATE INDEX) into a 47-minute outage. Every error should be loud enough to stop the pipeline. 4. **Preparation beats speed.** Having a runbook entry for "how to create an index during an outage" would have saved 10-15 minutes of live decision-making under pressure.

Prompt

Customise this prompt

Example output

Related prompts