Zero-downtime database migration strategy with rollback plan
Views
11.5K
Copies
3.4K
Likes
1.3K
Comments
0
Copy rate
30.0%
Prompt
You are a database migration specialist. We need to perform a [migration_type] on a [database_type] database in production. The table has [row_count] rows and the application serves [traffic_level] requests per second.
Design a zero-downtime migration strategy that includes:
1. Pre-migration checklist
2. Step-by-step migration plan with estimated timing
3. Application code changes needed (dual-write/dual-read pattern)
4. Rollback procedure at every step
5. Monitoring and alerting during migration
6. Post-migration cleanup
Assume the application is deployed on Kubernetes with a NestJS backend.Customise this prompt
Fill in 4 variables to personalise this prompt
Preview
You are a database migration specialist. We need to perform a [migration_type] on a [database_type] database in production. The table has [row_count] rows and the application serves [traffic_level] requests per second.
Design a zero-downtime migration strategy that includes:
1. Pre-migration checklist
2. Step-by-step migration plan with estimated timing
3. Application code changes needed (dual-write/dual-read pattern)
4. Rollback procedure at every step
5. Monitoring and alerting during migration
6. Post-migration cleanup
Assume the application is deployed on Kubernetes with a NestJS backend.
Example output
# Zero-Downtime Migration Plan
## Migration: Rename column `username` → `handle` on users table
**Database:** PostgreSQL 16
**Table size:** 8M rows (~2.4 GB)
**Traffic:** 2,000 req/s peak
**Estimated total time:** 4-6 hours (spread across 3 deploy phases)
---
## Pre-Migration Checklist
- [ ] **Backup verified**: pg_dump of users table completed and restored to staging successfully
- [ ] **Staging test**: Full migration rehearsed on staging with production-size data clone
- [ ] **Timing measured**: Staging migration completed in [X] minutes — extrapolate to production
- [ ] **Disk space**: At least 3x table size free (for index rebuilds + WAL)
- [ ] **Replication lag**: Verify replicas are < 100ms behind primary
- [ ] **Connection pool**: Increase max connections by 20% (migration queries take longer)
- [ ] **Monitoring dashboards open**: Query duration p99, error rate, replication lag, disk I/O
- [ ] **PagerDuty override**: Migration engineer is primary on-call for 6 hours
- [ ] **Rollback branch ready**: Git branch with rollback code changes, tested on staging
- [ ] **Communication**: #engineering Slack channel notified, status page updated
---
## Phase 1: Expand (Add New Column)
**Duration:** ~15 minutes | **Risk:** Low | **Downtime:** None
### Step 1.1: Add the new column
```sql
-- This is instant in PostgreSQL (no table rewrite for nullable column)
ALTER TABLE users ADD COLUMN handle VARCHAR(255);
```
**Time:** < 1 second (metadata-only change)
### Step 1.2: Create index concurrently
```sql
-- CONCURRENTLY: does not lock the table
CREATE UNIQUE INDEX CONCURRENTLY idx_users_handle ON users (handle)
WHERE handle IS NOT NULL;
```
**Time:** 5-10 minutes for 8M rows
### Step 1.3: Backfill data
```sql
-- Batch update to avoid long-running transactions
-- Process in chunks of 10,000 rows
DO $$
DECLARE
batch_size INT := 10000;
updated INT;
BEGIN
LOOP
UPDATE users
SET handle = username
WHERE handle IS NULL
AND id IN (
SELECT id FROM users WHERE handle IS NULL LIMIT batch_size
FOR UPDATE SKIP LOCKED
);
GET DIAGNOSTICS updated = ROW_COUNT;
RAISE NOTICE 'Updated % rows', updated;
EXIT WHEN updated = 0;
PERFORM pg_sleep(0.1); -- Brief pause to reduce replication lag
COMMIT;
END LOOP;
END $$;
```
**Time:** 15-30 minutes (800 batches × ~2s each with pauses)
### Rollback 1: Drop the new column
```sql
ALTER TABLE users DROP COLUMN handle;
```
---
## Phase 2: Migrate Code (Dual-Write)
**Duration:** 1 deploy cycle | **Risk:** Medium | **Downtime:** None
### Step 2.1: Deploy dual-write code
Application writes to BOTH `username` and `handle` on every update:
```typescript
// users.service.ts — Phase 2 (dual-write)
async updateUsername(userId: string, newUsername: string) {
return this.prisma.user.update({
where: { id: userId },
data: {
username: newUsername,
handle: newUsername, // dual-write
},
});
}
// Read from OLD column (username) — still the source of truth
async findByUsername(username: string) {
return this.prisma.user.findUnique({
where: { username },
});
}
```
### Step 2.2: Verify dual-write is working
```sql
-- After 24 hours, check for mismatches
SELECT COUNT(*) FROM users WHERE username != handle;
-- Should be 0 (or very low — only rows updated between backfill and deploy)
```
### Step 2.3: Fix any mismatches
```sql
UPDATE users SET handle = username WHERE username != handle;
```
### Rollback 2: Remove dual-write code, deploy
Revert the code change. The `handle` column stays but is ignored.
---
## Phase 3: Contract (Switch Reads, Drop Old Column)
**Duration:** 2 deploy cycles | **Risk:** Medium-High | **Downtime:** None
### Step 3.1: Deploy code that reads from `handle`
```typescript
// users.service.ts — Phase 3 (read from new, write to both)
async findByHandle(handle: string) {
return this.prisma.user.findFirst({
where: { handle },
});
}
```
### Step 3.2: Monitor for 24-48 hours
- Error rate should remain flat
- Query performance on `handle` index should match `username` index
- No application errors referencing the old column name
### Step 3.3: Deploy code that ONLY writes to `handle`
Remove all references to `username` in application code.
### Step 3.4: Drop the old column (after 1 week of stability)
```sql
-- Final cleanup
ALTER TABLE users DROP COLUMN username;
-- Rename index if desired
ALTER INDEX idx_users_handle RENAME TO idx_users_handle_unique;
```
### Rollback 3: If issues after switching reads
Deploy the Phase 2 code (reads from `username`, writes to both). The old column still has valid data for up to 1 week.
---
## Monitoring During Migration
### Alerts to set:
| Metric | Warning | Critical | Action |
|--------|---------|----------|--------|
| Query p99 latency | > 200ms | > 500ms | Pause backfill batches |
| Error rate (5xx) | > 0.1% | > 1% | Halt migration, investigate |
| Replication lag | > 1s | > 5s | Pause backfill, reduce batch size |
| Disk I/O wait | > 30% | > 60% | Pause backfill |
| Connection pool usage | > 80% | > 95% | Increase pool, pause migration |
### Real-time monitoring commands:
```sql
-- Check migration progress
SELECT COUNT(*) FILTER (WHERE handle IS NOT NULL) AS migrated,
COUNT(*) AS total,
ROUND(100.0 * COUNT(*) FILTER (WHERE handle IS NOT NULL) / COUNT(*), 1) AS pct
FROM users;
-- Check for long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '30 seconds';
-- Check replication lag
SELECT client_addr,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;
```
---
## Post-Migration Cleanup (1 week after Phase 3)
- [ ] Drop old column and unused indexes
- [ ] Update Prisma schema, run `prisma migrate`
- [ ] Remove dual-write code (if still present)
- [ ] Update API docs and OpenAPI spec
- [ ] Archive migration runbook
- [ ] Retrospective: document actual timings vs estimates