27 days ago•

Zero-downtime database migration strategy with rollback plan

Views

11.5K

Copies

3.4K

Likes

1.3K

Comments

Copy rate

30.0%

Prompt

You are a database migration specialist. We need to perform a [migration_type] on a [database_type] database in production. The table has [row_count] rows and the application serves [traffic_level] requests per second.

Design a zero-downtime migration strategy that includes:
1. Pre-migration checklist
2. Step-by-step migration plan with estimated timing
3. Application code changes needed (dual-write/dual-read pattern)
4. Rollback procedure at every step
5. Monitoring and alerting during migration
6. Post-migration cleanup

Assume the application is deployed on Kubernetes with a NestJS backend.

Customise this prompt

Fill in 4 variables to personalise this prompt

Preview

You are a database migration specialist. We need to perform a [migration_type] on a [database_type] database in production. The table has [row_count] rows and the application serves [traffic_level] requests per second. Design a zero-downtime migration strategy that includes: 1. Pre-migration checklist 2. Step-by-step migration plan with estimated timing 3. Application code changes needed (dual-write/dual-read pattern) 4. Rollback procedure at every step 5. Monitoring and alerting during migration 6. Post-migration cleanup Assume the application is deployed on Kubernetes with a NestJS backend.

Example output

# Zero-Downtime Migration Plan ## Migration: Rename column `username` → `handle` on users table **Database:** PostgreSQL 16 **Table size:** 8M rows (~2.4 GB) **Traffic:** 2,000 req/s peak **Estimated total time:** 4-6 hours (spread across 3 deploy phases) --- ## Pre-Migration Checklist - [ ] **Backup verified**: pg_dump of users table completed and restored to staging successfully - [ ] **Staging test**: Full migration rehearsed on staging with production-size data clone - [ ] **Timing measured**: Staging migration completed in [X] minutes — extrapolate to production - [ ] **Disk space**: At least 3x table size free (for index rebuilds + WAL) - [ ] **Replication lag**: Verify replicas are < 100ms behind primary - [ ] **Connection pool**: Increase max connections by 20% (migration queries take longer) - [ ] **Monitoring dashboards open**: Query duration p99, error rate, replication lag, disk I/O - [ ] **PagerDuty override**: Migration engineer is primary on-call for 6 hours - [ ] **Rollback branch ready**: Git branch with rollback code changes, tested on staging - [ ] **Communication**: #engineering Slack channel notified, status page updated --- ## Phase 1: Expand (Add New Column) **Duration:** ~15 minutes | **Risk:** Low | **Downtime:** None ### Step 1.1: Add the new column ```sql -- This is instant in PostgreSQL (no table rewrite for nullable column) ALTER TABLE users ADD COLUMN handle VARCHAR(255); ``` **Time:** < 1 second (metadata-only change) ### Step 1.2: Create index concurrently ```sql -- CONCURRENTLY: does not lock the table CREATE UNIQUE INDEX CONCURRENTLY idx_users_handle ON users (handle) WHERE handle IS NOT NULL; ``` **Time:** 5-10 minutes for 8M rows ### Step 1.3: Backfill data ```sql -- Batch update to avoid long-running transactions -- Process in chunks of 10,000 rows DO $$ DECLARE batch_size INT := 10000; updated INT; BEGIN LOOP UPDATE users SET handle = username WHERE handle IS NULL AND id IN ( SELECT id FROM users WHERE handle IS NULL LIMIT batch_size FOR UPDATE SKIP LOCKED ); GET DIAGNOSTICS updated = ROW_COUNT; RAISE NOTICE 'Updated % rows', updated; EXIT WHEN updated = 0; PERFORM pg_sleep(0.1); -- Brief pause to reduce replication lag COMMIT; END LOOP; END $$; ``` **Time:** 15-30 minutes (800 batches × ~2s each with pauses) ### Rollback 1: Drop the new column ```sql ALTER TABLE users DROP COLUMN handle; ``` --- ## Phase 2: Migrate Code (Dual-Write) **Duration:** 1 deploy cycle | **Risk:** Medium | **Downtime:** None ### Step 2.1: Deploy dual-write code Application writes to BOTH `username` and `handle` on every update: ```typescript // users.service.ts — Phase 2 (dual-write) async updateUsername(userId: string, newUsername: string) { return this.prisma.user.update({ where: { id: userId }, data: { username: newUsername, handle: newUsername, // dual-write }, }); } // Read from OLD column (username) — still the source of truth async findByUsername(username: string) { return this.prisma.user.findUnique({ where: { username }, }); } ``` ### Step 2.2: Verify dual-write is working ```sql -- After 24 hours, check for mismatches SELECT COUNT(*) FROM users WHERE username != handle; -- Should be 0 (or very low — only rows updated between backfill and deploy) ``` ### Step 2.3: Fix any mismatches ```sql UPDATE users SET handle = username WHERE username != handle; ``` ### Rollback 2: Remove dual-write code, deploy Revert the code change. The `handle` column stays but is ignored. --- ## Phase 3: Contract (Switch Reads, Drop Old Column) **Duration:** 2 deploy cycles | **Risk:** Medium-High | **Downtime:** None ### Step 3.1: Deploy code that reads from `handle` ```typescript // users.service.ts — Phase 3 (read from new, write to both) async findByHandle(handle: string) { return this.prisma.user.findFirst({ where: { handle }, }); } ``` ### Step 3.2: Monitor for 24-48 hours - Error rate should remain flat - Query performance on `handle` index should match `username` index - No application errors referencing the old column name ### Step 3.3: Deploy code that ONLY writes to `handle` Remove all references to `username` in application code. ### Step 3.4: Drop the old column (after 1 week of stability) ```sql -- Final cleanup ALTER TABLE users DROP COLUMN username; -- Rename index if desired ALTER INDEX idx_users_handle RENAME TO idx_users_handle_unique; ``` ### Rollback 3: If issues after switching reads Deploy the Phase 2 code (reads from `username`, writes to both). The old column still has valid data for up to 1 week. --- ## Monitoring During Migration ### Alerts to set: | Metric | Warning | Critical | Action | |--------|---------|----------|--------| | Query p99 latency | > 200ms | > 500ms | Pause backfill batches | | Error rate (5xx) | > 0.1% | > 1% | Halt migration, investigate | | Replication lag | > 1s | > 5s | Pause backfill, reduce batch size | | Disk I/O wait | > 30% | > 60% | Pause backfill | | Connection pool usage | > 80% | > 95% | Increase pool, pause migration | ### Real-time monitoring commands: ```sql -- Check migration progress SELECT COUNT(*) FILTER (WHERE handle IS NOT NULL) AS migrated, COUNT(*) AS total, ROUND(100.0 * COUNT(*) FILTER (WHERE handle IS NOT NULL) / COUNT(*), 1) AS pct FROM users; -- Check for long-running queries SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '30 seconds'; -- Check replication lag SELECT client_addr, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes FROM pg_stat_replication; ``` --- ## Post-Migration Cleanup (1 week after Phase 3) - [ ] Drop old column and unused indexes - [ ] Update Prisma schema, run `prisma migrate` - [ ] Remove dual-write code (if still present) - [ ] Update API docs and OpenAPI spec - [ ] Archive migration runbook - [ ] Retrospective: document actual timings vs estimates

Prompt

Customise this prompt

Example output

Related prompts