Tuesday, September 4, 2018

Design of Data Intensive Application - Part 3 Partitioning

Why?

All data doesn’t fit in one machine. So different partitions are made on different machines and each piece of data will live in one partition. This is useful for scaling applications.

We have to see

How to distribute the data among partitions
How does indexes work with distributed partitions
How to rebalance the partition if some nodes go down or if one partition size increases more than limit.
Partitioning is combined with replication so that copies of each partition can be stored at multiple nodes. Even though one piece of data belongs to one node, it will be stored at multiple places for fault tolerance.

Skewed partitions and hotspots

Ideally, we would want the data to be distributed to partitions equally so that the load is evenly distributed. But if we don’t partition it correctly and all the data is on a single partition then it is called a skewed partition and it will cause a hotspot.

Random partitioning

To make sure data is partitioned equally, we can assign data to partitions randomly. But the drawback is that then when we need to search the data, we have to search in all the partitions which is wasteful.

Key Range Partitioning

Another way of partitioning is to assign a continuous range of keys to a partition like a encyclopedia. The key ranges don’t need to be evenly spaced, but can be changed based on the data. The partition boundaries need to adapt based on the data to avoid hotspots. The administrator can change the boundaries or the software can automatically change them.

If the keys are sorted, then range scans become very easy. For e.g. in the data is some sensor data, then if timestamp is the key for the data, it will be very easy to scan for temperatures in a week. But one disadvantage of that is that since all the data is together, when writing the data all the data will go into the same partition while all the other partitions sit idle.

Hash Partitioning

To overcome the drawbacks with the key range partitioning, the key can be passed through a hash function. Each partition can contain a range of hash values. The hash values are different even for closely related keys, so this will distribute the data evenly among different partitions.

The partition boundaries can be evenly spaced or chosen pseudo randomly (called consistent hashing — it uses randomly chosen partition boundaries to avoid the need for central control or distributed consensus. )

The disadvantage of hash partitioning is that we need to scan all the partitions for range queries.

To overcome this Cassandra supports a compound primary key, which contains two keys. In this the first key is used for hash partitioning and the second key can be used for range scans. This is useful because if a key is setup as (user_id, update_timestamp), then it is easy to get all the latest updates for a user.

Relieving hotspots:

In a social media application, any power user having lot of followers can cause hot keys. i.e if they do anything, it has to be written to all the followers. Since the key is the user’s id, it will cause lot of activity for the key and can cause a hotspot. To overcome this their key can be split further by adding a random two digits to the key for some of the records. This way the load can be split into different partitions. Then the application should also remember which keys were split as most of the data will be regular without splitting.

Secondary indexes:

We have seen how the primary key is partitioned, but most of the databases support secondary indexes. Secondary index allows us to query the data by a different key or term.

There are two ways to partition the secondary index.

each partition can have the secondary index based on the documents that are part of the partition called local index. The advantages are that writes are faster as the data will be updating only one partition but the reads are difficult because for a given term, all the data will not be in one partition but the query will have to be done in parallel on all partitions (scatter) and then the results assembled (gather)
Another option is for the seconday index to be seperate from the main partitions and it could be partitioned in such a way that all the documents for a single term stay at same place (global index). The advantage of this is that search, range scan will be easy as all the data for an index is at the same place, but writing will be slow as one document has to touch multiple replicas
Re balancing Indexes:

Rebalancing is needed when we add new machines or we remove old failed nodes.

when rebalancing is complete,

load should be shared fairly between nodes
while rebalancing is happening, db should continue accepting reads and writes
no more data than necessary should be moved between nodes.
Rebalancing Stratgies

Hash mod n is a bad strategy because if the n changes, most of the keys have to be moved.
Fixed number of partitions: Create a large number of partitions more than the number of nodes and assign for e.g. 1000 partitions for 10 nodes and assign 100 for each node. If a new node is added, then few of the partitions for each node can be assigned to the new node and removal can proceed in the opposite way. Also the partitions need not be assigned equally to all nodes, bigger machines can take higher load compared to the others
Dynamic Partitioning: For dbs with key range partitioning with fixed number of partitions, if we choose wrong boundaries then all the data will be in one partition. For these dbs it is possible to have dynamic partitions where initially there will one partition and as the data size increases and once it goes more than a threshold, the partitions are split and it continues on. The disadvantage is that initially all the requests will be served by one node only. So to overcome this, the db can start with a fixed number of partitions and then increase as required. It can also decrease if large number of keys are deleted. Here the number of partitions adapt to the total data volume.
Partitioning Proportionally to nodes: In fixed number of partitions, size of each partition is proportional to the size of the dataset whereas in dynamic partitions the number of partitions is proportional to the size of the dataset. A third option used by Cassandra is to have number of partitions proportional to the number of nodes. initially it will have fixed number of partitions per node and when a new node joins the cluster, new partitions are added, so the size of the partition decreases.
Rebalancing can be fully automatic or manual, but since it is an involved process and if something goes wrong it is better to have an operator in the loop to review the processing.

Request Routing:

Since the data is now split into different partitions how does the client know which node to connect to ? That is the service discovery problem and there are multiple tools for this

Approaches to solve this are

Client can connect to any node. If that node has the data it will handle the request. Otherwise it will query the right node, get the result and pass the data to the client.
Client can send the request to a routing tier. It will send the request to the right node and forwards accordingly.
Clients can be aware of the partitions. If they are aware they can send to the right node without the need of an intermediary.
When the data is rebalanced, the routing tier or the partionaware client has to know of the change. Generally this is a consensus problem and there are different solutions.

One solution is to use a coordination tool like Zookeeper to maintain the mapping of partitions to nodes. The routing tier can subscribe to the changes for these mappings.

Some dbs like Cassandra and Riak use a gossip protocol where cluster is aware of changes and clients can send requests to any node and the cluster will forward the request as required.

Source: https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/

No comments:

Post a Comment

Comments