Write a complete technical RFC document from a rough problem statement
Views
6.2K
Copies
1.6K
Likes
867
Comments
0
Copy rate
26.0%
Prompt
You are a principal engineer who has authored 50+ RFCs at a major tech company. Write a technical RFC (Request for Comments) document based on the following inputs.
Problem statement: "[problem_statement]"
System context: [system_context]
Constraints: [constraints]
RFC structure:
1. Title and metadata (authors, status, date, reviewers)
2. Summary (one paragraph)
3. Motivation — why this matters now
4. Detailed design
5. Alternatives considered (at least 2, with pros/cons)
6. Migration strategy
7. Observability and monitoring
8. Security considerations
9. Open questions
10. Timeline and milestones
Be technically precise. Include diagrams described in text where helpful. Assume reviewers are senior engineers who will challenge hand-wavy claims.Customise this prompt
Fill in 3 variables to personalise this prompt
Preview
You are a principal engineer who has authored 50+ RFCs at a major tech company. Write a technical RFC (Request for Comments) document based on the following inputs.
Problem statement: "[problem_statement]"
System context: [system_context]
Constraints: [constraints]
RFC structure:
1. Title and metadata (authors, status, date, reviewers)
2. Summary (one paragraph)
3. Motivation — why this matters now
4. Detailed design
5. Alternatives considered (at least 2, with pros/cons)
6. Migration strategy
7. Observability and monitoring
8. Security considerations
9. Open questions
10. Timeline and milestones
Be technically precise. Include diagrams described in text where helpful. Assume reviewers are senior engineers who will challenge hand-wavy claims.
Example output
# RFC-2024-031: Migrate Session Storage from PostgreSQL to Redis
**Authors:** Platform Team
**Status:** Draft
**Created:** June 28, 2024
**Reviewers:** Backend Lead, SRE Lead, Security Lead
**Decision deadline:** July 12, 2024
---
## Summary
This RFC proposes migrating user session storage from PostgreSQL to Redis to reduce p99 authentication latency from 340ms to under 50ms, eliminate session-table bloat (currently 47GB and growing 2GB/month), and unblock the real-time features roadmap that requires sub-10ms session lookups. The migration uses a dual-write strategy with a 2-week shadow period to validate correctness before full cutover.
---
## Motivation
### The Problem
Our session table in PostgreSQL has grown to 47GB with 180M rows. Despite indexing, session lookup queries now take 12ms at p50 and 340ms at p99 due to table bloat and vacuum contention. This directly impacts every authenticated API request because the auth middleware hits this table on every call.
### Why Now
1. **Real-time features (Q3 roadmap):** The upcoming collaborative editing feature requires session validation on every WebSocket frame. At current latency, this adds 340ms of overhead per message at p99 — unacceptable.
2. **Database load:** Session queries account for 38% of all PostgreSQL read traffic. Moving them frees significant headroom.
3. **Cost:** We're scaling up the primary DB instance specifically to handle session load. Redis is 4x cheaper for this access pattern.
### Success Criteria
- p99 session lookup < 50ms (currently 340ms)
- PostgreSQL read IOPS drops by 30%+
- Zero session-related incidents during migration
- No user-facing downtime
---
## Detailed Design
### Data Model
Current PostgreSQL schema:
```
sessions (
id UUID PRIMARY KEY,
user_id UUID NOT NULL,
token_hash VARCHAR(64) NOT NULL,
device_info JSONB,
created_at TIMESTAMPTZ,
expires_at TIMESTAMPTZ,
last_active_at TIMESTAMPTZ
)
```
Proposed Redis structure:
```
Key: session:{token_hash}
Value: Hash {
user_id, device_info, created_at, last_active_at
}
TTL: 7 days (matches current session expiry)
```
Secondary index for user session listing:
```
Key: user_sessions:{user_id}
Value: Set of token_hash values
TTL: 7 days
```
### Architecture
```
[API Request] → [Auth Middleware]
│
┌───────┴───────┐
│ Phase 1: │
│ Read Redis │──miss──→ [Read Postgres] ──→ [Backfill Redis]
│ first │
└───────┬───────┘
│ hit
▼
[Continue to handler]
```
### Write Path
During dual-write phase:
1. Write to Redis (primary) — if this fails, fail the request
2. Write to PostgreSQL (async) — fire-and-forget via background job
3. After shadow period: stop PostgreSQL writes entirely
### Read Path
1. Check Redis first
2. On Redis miss: check PostgreSQL, backfill Redis if found
3. After shadow period: Redis-only reads, PostgreSQL becomes cold archive
### Session Invalidation
- Single session logout: delete `session:{hash}` + remove from `user_sessions:{user_id}`
- "Logout all devices": iterate `user_sessions:{user_id}` set, delete each session key, delete the set
- Password change: same as "logout all" (existing behavior preserved)
---
## Alternatives Considered
### Alternative A: Keep PostgreSQL, Add Read Replica + Caching Layer
**Pros:** No migration risk; simpler architecture; keeps ACID guarantees.
**Cons:** Doesn't solve table bloat (still need periodic cleanup); read replica adds 50-200ms replication lag which can cause stale session reads; caching layer adds complexity without removing the root cause.
**Verdict:** Rejected — treats symptoms, not the cause.
### Alternative B: Use DynamoDB Instead of Redis
**Pros:** Fully managed; built-in TTL; no cluster management; potentially lower ops burden.
**Cons:** p99 latency is 15-25ms (vs. Redis sub-5ms); introduces a new data store to the stack; team has no DynamoDB expertise; vendor lock-in; cost is 2x Redis for our access pattern.
**Verdict:** Rejected — Redis meets latency requirements better and team already operates Redis for caching.
---
## Migration Strategy
### Phase 1: Dual-Write (Week 1–2)
- Deploy dual-write code behind feature flag
- All new sessions written to both stores
- Reads still hit PostgreSQL (source of truth)
- Monitor: compare Redis vs Postgres read results for every request (shadow reads)
### Phase 2: Redis-Primary Reads (Week 3)
- Flip read path to Redis-first with PostgreSQL fallback
- Backfill any misses into Redis
- Monitor: Redis hit rate should climb to 99%+ within 48 hours
### Phase 3: PostgreSQL Write Stop (Week 4)
- Stop writing new sessions to PostgreSQL
- Keep PostgreSQL data for 30 days (rollback window)
- Monitor: PostgreSQL session table should stop growing
### Phase 4: Cleanup (Week 5)
- Drop PostgreSQL session table
- Remove dual-write code and feature flags
- Update runbooks and on-call documentation
### Rollback Plan
At any phase, flip the feature flag to revert to PostgreSQL-only mode. During dual-write, PostgreSQL always has the complete dataset.
---
## Observability and Monitoring
| Metric | Alert Threshold | Dashboard |
|---|---|---|
| Redis session lookup p99 | > 50ms for 5 min | Session Performance |
| Redis miss rate | > 5% after Phase 2 | Session Performance |
| Redis memory usage | > 80% of allocated | Infrastructure |
| Dual-write consistency errors | > 0 | Migration Health |
| Session creation/deletion rate | ±50% from baseline | Session Performance |
---
## Security Considerations
- Session tokens are already hashed before storage — no change needed
- Redis instance must be in the same VPC, no public access
- Enable Redis AUTH with a rotated password stored in secrets manager
- TLS in transit between application and Redis
- Redis RDB snapshots encrypted at rest
- Audit log: all "logout all devices" actions continue to be logged to the audit trail
---
## Open Questions
1. **Redis cluster vs. single instance?** Current session volume (180M) fits in a single 64GB instance, but cluster gives us HA failover. Leaning toward cluster. Need SRE input on operational overhead.
2. **Should we migrate historical sessions or only new ones?** Current active sessions (~2.3M) could be bulk-loaded in under 10 minutes. Leaning toward yes for cleaner cutover.
3. **Rate limiting interaction:** Our rate limiter also uses Redis. Should they share an instance or use separate clusters? Need to assess failure blast radius.
---
## Timeline
| Week | Milestone | Owner |
|---|---|---|
| W1 | Redis cluster provisioned, dual-write code merged | SRE + Backend |
| W2 | Shadow reads enabled, consistency monitoring live | Backend |
| W3 | Redis-primary reads flipped, backfill running | Backend |
| W4 | PostgreSQL writes stopped, 30-day retention begins | Backend + SRE |
| W5 | Cleanup: drop table, remove feature flags, update docs | Backend |