# Handling failures

As with any large system at scale, failures are not only inevitable but common. Handling failure scenarios is very important. In this section, we first introduce techniques to detect failures. Then, we go over common failure resolution strategies.&#x20;

### Failure detection&#x20;

In a distributed system, it is insufficient to believe that a server is down because another server says so. Usually, it requires at least two independent sources of information to mark a server down.&#x20;

As shown in the picture below, all-to-all multicasting is a straightforward solution. However, this is inefficient when many servers are in the system.

<figure><img src="/files/CYwmExZ7709U8Fnr84M4" alt=""><figcaption></figcaption></figure>

A better solution is to use decentralized failure detection methods like gossip protocol. Gossip protocol works as follows:&#x20;

* Each node maintains a node membership list, which contains member IDs and heartbeat counters.&#x20;
* Each node periodically increments its heartbeat counter.&#x20;
* Each node periodically sends heartbeats to a set of random nodes, which in turn propagate to another set of nodes.&#x20;
* Once nodes receive heartbeats, the membership list is updated to the latest info.&#x20;
* If the heartbeat has not increased for more than predefined periods, the member is considered offline.

<figure><img src="/files/0iu0tUvRv5IQzJpHzOe9" alt=""><figcaption></figcaption></figure>

* Node s0 maintains a node membership list shown on the left side.&#x20;
* Node s0 notices that node s2’s (member ID = 2) heartbeat counter has not increased for a long time.&#x20;
* Node s0 sends heartbeats that include s2’s info to a set of random nodes. Once other nodes confirm that s2’s heartbeat counter has not been updated for a long time, node s2 is marked down, and this information is propagated to other nodes.

### Handling temporary failures&#x20;

After failures have been detected through the gossip protocol, the system needs to deploy certain mechanisms to ensure availability. In the strict quorum approach, read and write operations could be blocked as illustrated in the quorum consensus section.&#x20;

A technique called “sloppy quorum” \[4] is used to improve availability. Instead of enforcing the quorum requirement, the system chooses the first W healthy servers for writes and the first R healthy servers for reads on the hash ring. Offline servers are ignored.&#x20;

If a server is unavailable due to network or server failures, another server will process requests temporarily. When the down server is up, changes will be pushed back to achieve data consistency. This process is called hinted handoff. Since s2 is unavailable in the picture below, reads and writes will be handled by s3 temporarily. When s2 comes back online, s3 will hand the data back to s2.

<figure><img src="/files/sE3hjA5Re4dbbAqa02zQ" alt=""><figcaption></figcaption></figure>

### Handling permanent failures

Hinted handoff is used to handle temporary failures. What if a replica is permanently unavailable? To handle such a situation, we implement an anti-entropy protocol to keep replicas in sync. Anti-entropy involves comparing each piece of data on replicas and updating each replica to the newest version. A Merkle tree is used for inconsistency detection and minimizing the amount of data transferred.&#x20;

Quoted from Wikipedia: “A hash tree or Merkle tree is a tree in which every non-leaf node is labeled with the hash of the labels or values (in case of leaves) of its child nodes. Hash trees allow efficient and secure verification of the contents of large data structures”.&#x20;

Assuming the key space is from 1 to 12, the following steps show how to build a Merkle tree. Highlighted boxes indicate inconsistency.&#x20;

**Step 1:** Divide key space into buckets (4 in our example) as shown in the picture below. A bucket is used as the root level node to maintain a limited depth of the tree.

<figure><img src="/files/YtvxwgZlMa7TIRscBJJC" alt=""><figcaption></figcaption></figure>

**Step 2:** Once the buckets are created, hash each key in a bucket using a uniform hashing method

<figure><img src="/files/DuthH0ENpsq4eJYOwmbd" alt=""><figcaption></figcaption></figure>

**Step 3:** Create a single hash node per bucket

<figure><img src="/files/P3dqKhESowR8paebuL9A" alt=""><figcaption></figcaption></figure>

**Step 4:** Build the tree upwards to root by calculating hashes of children

<figure><img src="/files/WUtB99rRTZ5ktnlwtoHm" alt=""><figcaption></figcaption></figure>

To compare two Merkle trees, start by comparing the root hashes. If root hashes match, both servers have the same data. If root hashes disagree, then the left child hashes are compared followed by the right child hashes. You can traverse the tree to find which buckets are not synchronized and synchronize those buckets only.&#x20;

Using Merkle trees, the amount of data needed to be synchronized is proportional to the differences between the two replicas, and not the amount of data they contain. In real-world systems, the bucket size is quite big. For instance, a possible configuration is one million buckets per one billion keys, so each bucket only contains 1000 keys.

### Handling data center outage&#x20;

Data center outage could happen due to power outage, network outage, natural disaster, etc. To build a system capable of handling data center outage, it is important to replicate data across multiple data centers. Even if a data center is completely offline, users can still access data through the other data centers.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://huy312100.gitbook.io/software-development/system-design/fundamental/design-key-value-store/system-components/handling-failures.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
