Database Sharding

Q: What is Database Sharding?

Learn database sharding strategies including hash-based, range-based, and directory-based sharding, consistent hashing, and real-world examples from Instagram and Discord

What Is Database Sharding?

Database sharding (also called horizontal partitioning) is the practice of splitting a large database into smaller, more manageable pieces called shards. Each shard is an independent database that holds a subset of the total data. Together, the shards make up the complete dataset.

Sharding is typically the last resort for database scaling — you should first optimize queries, add caching, vertically scale, and add read replicas. But when your data grows beyond what a single machine can handle (typically in the hundreds of gigabytes to terabytes range), or your write throughput exceeds what a single primary can support, sharding becomes necessary.

When to Consider Sharding

Data volume: Your data exceeds the storage capacity of a single database server.
Write throughput: A single primary database cannot handle the write volume.
Query performance: Indexes become too large to fit in memory, degrading query performance.
Geographic distribution: You need data to be physically close to users in different regions.

Sharding Strategies

1. Hash-Based Sharding

Apply a hash function to the shard key (e.g., user ID) and use the result to determine which shard stores the data. The most common approach is shard_number = hash(shard_key) % number_of_shards.

// Hash-based sharding
function getShardIndex(userId: string, totalShards: number): number {
  // Simple hash function
  let hash = 0;
  for (let i = 0; i < userId.length; i++) {
    const char = userId.charCodeAt(i);
    hash = ((hash << 5) - hash) + char;
    hash = hash & hash; // Convert to 32-bit integer
  }
  return Math.abs(hash) % totalShards;
}

// Usage
const shardIndex = getShardIndex('user_12345', 4);
// Routes to shard 0, 1, 2, or 3 based on the hash

class ShardedDatabase {
  private shards: DatabaseConnection[];

  constructor(shardConnections: DatabaseConnection[]) {
    this.shards = shardConnections;
  }

  async getUser(userId: string): Promise<User> {
    const shardIndex = getShardIndex(userId, this.shards.length);
    const shard = this.shards[shardIndex];
    return shard.query('SELECT * FROM users WHERE id = $1', [userId]);
  }

  async createUser(user: User): Promise<void> {
    const shardIndex = getShardIndex(user.id, this.shards.length);
    const shard = this.shards[shardIndex];
    await shard.query('INSERT INTO users (id, name, email) VALUES ($1, $2, $3)',
      [user.id, user.name, user.email]);
  }
}

Pros: Even data distribution (if the hash function is good), simple to implement, no need for a lookup table.
Cons: Adding or removing shards requires rehashing all data (unless you use consistent hashing). Cross-shard queries are difficult. Range queries on the shard key are not possible.

2. Range-Based Sharding

Partition data based on ranges of the shard key. For example, users with IDs 1-1,000,000 go to shard 1, IDs 1,000,001-2,000,000 go to shard 2, and so on. Another common approach is to shard by date range — each month or year gets its own shard.

// Range-based sharding
interface ShardRange {
  min: number;
  max: number;
  connection: DatabaseConnection;
}

class RangeShardedDatabase {
  private ranges: ShardRange[];

  constructor(ranges: ShardRange[]) {
    // Ranges must be sorted and non-overlapping
    this.ranges = ranges.sort((a, b) => a.min - b.min);
  }

  getShardForKey(key: number): DatabaseConnection {
    for (const range of this.ranges) {
      if (key >= range.min && key <= range.max) {
        return range.connection;
      }
    }
    throw new Error(`No shard found for key: ${key}`);
  }

  // Range queries are efficient because they often hit a single shard
  async getUsersInRange(startId: number, endId: number): Promise<User[]> {
    const relevantShards = this.ranges.filter(
      range => range.max >= startId && range.min <= endId
    );

    const results = await Promise.all(
      relevantShards.map(shard =>
        shard.connection.query(
          'SELECT * FROM users WHERE id BETWEEN $1 AND $2',
          [Math.max(startId, shard.min), Math.min(endId, shard.max)]
        )
      )
    );

    return results.flat();
  }
}

Pros: Range queries are efficient (data in a range is likely on the same shard), easy to understand, new shards can be added for new ranges without resharding existing data.
Cons: Prone to hotspots (e.g., the shard with the newest data gets all the writes), uneven distribution if data is not uniformly distributed across the range.

3. Directory-Based Sharding

Maintain a lookup table (directory) that maps each shard key to its shard. This is the most flexible approach because you can move data between shards by simply updating the directory, but the directory itself becomes a critical component and potential bottleneck.

// Directory-based sharding
class DirectoryShardedDatabase {
  private directory: Redis; // Fast lookup store
  private shards: Map<string, DatabaseConnection>;

  async getShardForUser(userId: string): Promise<DatabaseConnection> {
    // Look up which shard this user belongs to
    const shardId = await this.directory.get(`user-shard:${userId}`);

    if (!shardId) {
      // New user - assign to the least loaded shard
      const targetShard = await this.getLeastLoadedShard();
      await this.directory.set(`user-shard:${userId}`, targetShard);
      return this.shards.get(targetShard)!;
    }

    return this.shards.get(shardId)!;
  }

  // Moving a user to a different shard is easy
  async migrateUser(userId: string, targetShardId: string): Promise<void> {
    const currentShardId = await this.directory.get(`user-shard:${userId}`);
    const currentShard = this.shards.get(currentShardId!)!;
    const targetShard = this.shards.get(targetShardId)!;

    // Copy data to new shard
    const userData = await currentShard.query('SELECT * FROM users WHERE id = $1', [userId]);
    await targetShard.query('INSERT INTO users VALUES ($1, $2, $3)', [
      userData.id, userData.name, userData.email
    ]);

    // Update directory
    await this.directory.set(`user-shard:${userId}`, targetShardId);

    // Delete from old shard
    await currentShard.query('DELETE FROM users WHERE id = $1', [userId]);
  }
}

Sharding Strategy Comparison

Aspect	Hash-Based	Range-Based	Directory-Based
Distribution	Even	Can be uneven	Controlled
Range queries	Difficult	Efficient	Depends
Adding shards	Requires rehashing	Add new range	Update directory
Hotspot risk	Low	High	Low
Complexity	Low	Medium	High

Consistent Hashing for Sharding

The problem with basic hash-based sharding (hash(key) % N) is that when you add or remove a shard, the modulus changes and almost all keys need to be remapped. Consistent hashing solves this by ensuring that only K/N keys need to be remapped when a shard is added or removed (where K is the total number of keys and N is the number of shards).

With consistent hashing, both servers and keys are placed on a hash ring. Each key is assigned to the nearest server clockwise on the ring. When a server is added or removed, only the keys between it and the previous server on the ring are affected.

Cross-Shard Queries

One of the biggest challenges with sharding is handling queries that span multiple shards. If a query needs data from multiple shards, you must query each shard individually, combine the results, and sort/filter them in the application layer. This is known as a scatter-gather pattern.

Cross-Shard Challenges

Joins across shards: SQL JOINs between tables on different shards are not possible. You must denormalize data or perform joins in application code.
Transactions: ACID transactions across shards require distributed transaction protocols (like 2PC) which are slow and complex.
Aggregations: COUNT, SUM, AVG across all shards require querying every shard and combining results.
Unique constraints: Enforcing uniqueness across shards requires a global lookup or a central authority.

Real-World Sharding Examples

Instagram's Sharding Strategy

Instagram shards their PostgreSQL database by user ID. Each logical shard maps to a PostgreSQL schema within a larger database, and multiple schemas live on each physical server. Their shard key is embedded in their globally unique photo IDs using a custom ID generation scheme:

41 bits for timestamp (milliseconds since custom epoch)
13 bits for logical shard ID
10 bits for auto-incrementing sequence

This allows them to determine the shard for any photo just by looking at its ID, without a lookup table. They use thousands of logical shards distributed across a smaller number of physical servers, making it easy to rebalance by moving logical shards between physical machines.

Discord's Message Storage

Discord shards messages by channel ID. All messages in a channel live on the same shard, which keeps channel-level queries fast. They started with MongoDB but migrated to Cassandra as they scaled. Their shard key is (channel_id, message_id), which ensures messages within a channel are stored together on disk for efficient retrieval.

However, they eventually moved from Cassandra to ScyllaDB (a C++ rewrite of Cassandra) because Cassandra's garbage collection pauses caused latency spikes. This highlights an important lesson: sharding strategy and database technology choice are interconnected decisions.

Sharding Best Practices

Choose the shard key carefully. It should distribute data evenly and align with your most common query patterns.
Avoid resharding if possible. Use consistent hashing and start with more logical shards than you need, mapped to fewer physical servers.
Denormalize data. Accept some data duplication to avoid cross-shard joins.
Test with production-like data. Sharding behavior can vary dramatically based on data distribution.
Have a migration plan. You will need to add shards eventually. Plan for how you will migrate data with minimal downtime.