Database sharding decision framework with example schemas
Views
1.9K
Copies
523
Likes
291
Comments
0
Copy rate
27.0%
Prompt
You are a database architect. Write a decision framework for when to shard a [database_engine] database for a [workload_type] workload, and sketch concrete shard keys for [example_tables].
Structure:
1. Signals that you actually need sharding (not just a read replica)
2. Choosing a shard key: hash vs range vs directory
3. Worked example for each table with the chosen shard key and why
4. Rebalancing strategy and common gotchas
Be opinionated and specific.Customise this prompt
Fill in 3 variables to personalise this prompt
Preview
You are a database architect. Write a decision framework for when to shard a [database_engine] database for a [workload_type] workload, and sketch concrete shard keys for [example_tables].
Structure:
1. Signals that you actually need sharding (not just a read replica)
2. Choosing a shard key: hash vs range vs directory
3. Worked example for each table with the chosen shard key and why
4. Rebalancing strategy and common gotchas
Be opinionated and specific.
Example output
Sharding is expensive, irreversible in practice, and usually the wrong answer for the first three years of a product. Before reaching for it on a PostgreSQL database running a social/content workload across users, posts, and comments, confirm the real signals: a single primary is past 70% CPU for sustained periods, the largest table is above roughly 500GB with active write hotspots, replica lag regularly breaches acceptable thresholds under normal load, and you have already exhausted partitioning, connection pooling, aggressive indexing, and a read-replica tier. If those tools still leave headroom, do not shard. Vertical scale and Postgres native partitioning will buy you another year cheaper than any sharding project.
Once the signals are real, the shard key becomes the single most consequential decision in the system. Three patterns dominate. Hash sharding — compute hash(key) % N — spreads load evenly and prevents hotspots, but makes range queries expensive and makes resharding painful. Range sharding — rows 1-1M go to shard A, 1M-2M to shard B — makes range scans trivial but almost always creates write hotspots on the newest range. Directory sharding uses an explicit lookup table that maps entity ids to shards; it is the most flexible and the most operationally complex, because the directory itself becomes a critical path.
For the three example tables: Users should be hash-sharded on user_id. Access patterns are random, the id space is well-distributed, and nothing about user access is naturally range-ordered. Posts should be directory-sharded by author_id with posts colocated on the same shard as their author, because the common query is "posts by user" and co-location avoids cross-shard joins. Comments should follow their parent post — same shard as the post_id — so a post detail page hits exactly one shard for post plus all comments.
Rebalancing is where most teams get hurt. Plan for it on day one: use consistent hashing (or a directory lookup) so that doubling from 8 to 16 shards moves roughly 1/N of data rather than all of it. Build a dual-write migration path and a shadow-read verification tool before you need them. Log every cross-shard query in staging and treat its existence as a design bug. The hardest gotcha in practice is transactions that were implicit across tables suddenly spanning shards — the application must learn to tolerate eventual consistency or explicitly route related writes to the same shard via the chosen key.