Implementation and understanding of consistent hashing

This entry is part of the cache system entries. We have
our LRU cache algorithm implemented, today we
are going to explain the way the data is splitted around the nodes (caches), this
process has the name of sharding.

A hashing algorithm is only a process where we get a position of the data based
on the hash. This is commonly used in hash tables (dictionaries and stuff) in
programming languages for example.

Before explaining the algorithm we selected, we will explain a regular sharding
based on an example.

Regular Sharding

We have 3 nodes (node 0, node 1, node 2). We will store some sort of data in these nodes.
The data keys (identifiers) are letters. So...
we want to split the data around the 3 nodes, the splitting should be uniform
around the 3 nodes. To do that we have a hashing function based on the received
key. This is, we take the key, get his number in the ascii table
and take the module of the number of nodes. For example for letter "a", this
will be 67 % 3 = 1. So data "a" will be stored in node 1. As always an image
is always a good idea.

This implementation sounds perfect, the data is distributed around the cluster,
and is uniform more or less... Besides, it has some problems. At the beggining of
the first entry we said the need adding and
removing nodes with the least data lose.

This method has a problem, when we add or remove a node to the cluster, the
number of nodes change, this means that the hashing function result for the same
data will be different, so imagine that we add a new node, and afterwards a client
asks for the letter "a" data, this means: 67 % 4 = 3 You see where the problem is right?
the data identified by letter "a" is in node 1 but now the hashing function says
that is in node 3.

regular sharding 4 nodes

As we see in the picture the data is different everywhere! What we do now? well...
we have to distribute again our data around the node, that's not consistent.
But hey, we know how to solve this, (more or less), lets see the amazing consistent hashing algorithm.

Consistent hashing

First of all, here is the paper
describing the algorithm, you should check it if you want to dive more in to the explanation of this algorithm ;)

Consistent hashing aim is to provide a uniform data sharding around the cluster
with the least data lose.

The first thing to understand this, is picturing the cluster as a ring. Each
node will have a key (we will use it in the hashing later on),
for example, a common key is the network address of the node (

I will help you picturing it ;)


Once we have pictured the real nodes in the ring, the next step is to add virtual
nodes to the ring. You will be asking... What is a virtual node? a Virtual node
is a replica of a node in terms of position in the ring, not in terms of real
replica as a data backup, in other words. Each node will be in different places of the ring
at the same time (later on we will explain what is a place in the ring and how to get it)
we add virtual nodes in terms of replica variable, this variable says how many
"nodes" will be per node, a good number is 3 (we should not put in a
ring more that 100 in total of all the nodes). A picture, will clear things up.

virtual ring

Now the magic. First our hash function, this will be CRC32(we could use any
function which give us a comparable unit as result, in this case is an integer).
First of all we need to place our "nodes" in the ring, to do that we hash the key
of each node. We said that the key of each node will be the location, so having 3
virtual nodes means that the 3 will have the same key... well no... each node will
have the replica number in the prefix of the key, for example for node 0 will be
the keys: 0-, 1-, 2- This

will give us as result the hashes: 2023508419, 3606370386, 4282150048. After
having all the hashes of all the nodes we will put the nodes in clockwise
direction order. This will be the resulting ring:

ring placed

Now is as simple as hashing the key of each data and set the resulting data in the
ring. so if a data key hash is 3900000000 this will be placed between 3861339617(node 1)
and 4282150048(node 0), the direction is clockwise so this data will be stored in
node 0.

ring placed data

As we see in the picture the data has been splitted only between 2 nodes, but this is
normal, as the data number grows, the distribution will be more uniform (we only
have distribute 12 items, that's nothing...).

Consistent hashing allows us to add and remove nodes without affecting all the data,
normally is a 25%-30% of the data. Imagine that we remove node 0 from our ring.
The other nodes data will remain intact, because now the hash depends on a key by host and
that's a constant. So in other words we lose 5 of 12 items, that's a 41%. "Hey, but you said between 25% and 35%" As I said
before, we have few items, this isn't the common case in the ring.

I think that is the perfect moment to start implementing the ring ;)

Consistent hashing implementation

As always we will start with the skeleton of the implementation.

We don't need more methods, lets describe the requirement variables to have
our ring logic,

First we have _hash this is the hashing function, this will be the function that we will call when
we want to get the hash for a key (its the same function for node keys and item keys).

Then we have _replicas this will be the number of replicas in the ring for each node.

The _keys will be a double linked list that will have the positions (hashes) of the nodes in the ring,
this list will be sorted.

And finally a _hash_map that has the keys identified by the hash so we can retrieve the original key with the hash,
here we will store the hashes of the node keys as keys and the original node keys as values, for example {3011211833: ""}

One of the most important methods (well we have 4 :P) is the add. First we
receive a list of nodes as parameters. We iterate over each one and start the
logic of adding a node to the ring. For each node we add N number of nodes
(replica/virtual node) where N is th numebr of replicas. For each replica we get
the hash of the key in the format: {N}{host}:{port}, we add this hash to the
linked list, and then add to the dict with the hash as key, and the node key as value,
this will allow us to get the key (and network address ;) with the hash. Finally
the ring is sorted by value

When we remove a node we do the opposite, very simple. We itearete for
each received node, for each node we iterate over the number of replicas, we get
the hash of each replica and finally we delete from the linked list and dict.
We don't have to sort because when we remove from the list the sorting isn't modified

Finally the method that give us where to place an item in the ring :D. get gives
us the node key (network address) of the node where the data should be placed
based on the key of the item/data (in the example of this entry a, b, c...),
for example an item for the key a could be: {"a": "This is the data to store"}.

The method doesn't receive the data, it only receives the key. Take into account that this class
only manages the ring, doesn't manage data storing, that was explained in the last
, so the data handling should be managed in a place where the LRU container and the consistent hashing ring
are managed as a whole (we will talk how to do this in the next entry).

get receives the key, if the ring is empty (no nodes) then it returns nothing, if
there are nodes in the ring then starts iterating over the linked list to see wich node hash
is greater than the key hash, when it finds this hash it returns the key(node address) of the hash (stored in the dict).
So the fancy thing of sorting the linked list when we add a node to the ring is for this,
(iterate from the lowest hash). If the iteration has finished and we haven't found a hash greater than the hash of the key
this means that we are at the beggining of the ring, so we return the node with the lowest hash (in a sorted list is the first element).

As you can see is very simple to implement when you understand the algorithm. Here
you have all the code at once for convenience:

This algorithm is used in many good projects like groupcache, its implementation
is in Go (My second favorite language after Python, yo should check it ;)

See you in the next entry where we will discuss how to use this ring and the LRU
container to get our node running [To be updated]