Memcached SPOF Mystery

Tuesday, 20 April 2010

At Twitter, we use memcached to speed up page loads and alleviate database load. We have many memcached hosts. To make our system robust, our memcached clients use consistent hashing and enable the auto_eject_hosts option. With this many hosts and this kind of configuration, one would assume that it won’t be noticeable if one memcached host goes down, right? Unfortunately, our system will have elevated robots whenever a memcached host dies or is taken out of the system. The system does not recover on its own unless the memcached host is brought back. Essentially, every memcached host is a single point of failure.

Memcached SPOF Mystery

This is what we observed when a memcached host crashed recently. Web 502/503s spiked and recovered, and web/api 500 errors occur at a sustained elevated rate.

Why is this happening? At first, we thought the elevated robots were caused by remapping the keys on the dead host. After all, reloading from databases can be expensive. When other memcached hosts have more data to read from databases than they can handle, they may throw exceptions. But that should only happen if some memcached hosts are near their capacity limit. Whereas the elevated robots can happen even during off-peak hours. There must be something else.

A closer look at the source of those exceptions surprised us. It turns out those exceptions are not from the requests sent to the other healthy memcached hosts but from the requests sent to the dead host! Why do the clients keep sending requests to the dead host?

This is related to the “auto_eject_hosts” option. The purpose of this option is to let the client temporarily eject dead hosts from the pool. A host is marked as dead if the client has a certain number of consecutive failures with the host. The dead server will be retried after retry timeout. Due to general unpredictable stuff such as network flux, hardware failures, timeouts due to other jobs in the boxes, requests sent to healthy memcached hosts can fail sporadically. The retry timeout is thus set to a very low value to avoid remapping a large number of keys unnecessarily.

When a memcached host is really dead, however, this frequent retry is undesirable. The client has to establish connection to the dead host again, and it usually gets one of the following exceptions: Memcached::ATimeoutOccurred, Memcached::UnknownReadFailure, Memcached::SystemError (“Connection refused”), or Memcached::ServerIsMarkedDead. Unfortunately, a client does not share the “the dead host is still dead” information with other clients, so all clients will retry the dead host and get those exceptions at very high frequency.

The problem is not difficult to fix once we get better understanding of the problem. Simply retrying a memcached request once or twice on those exceptions usually works. Here is the list of all the memcached runtime exceptions. Ideally, memcached client should have some nice build-in mechanisms (e.g. exponential backoff) to retry some of the exceptions, and optionally log information about what happened. The memcached client shouldn’t transparently swallow all exceptions, which would cause users to lose all visibilities into what’s going on.

After we deployed the fix, we don’t see elevated robots any more when a memcached host dies. The memcached SPOF mystery solved!

P.S. Two option parameters “exception_retry_limit” and “exceptions_to_retry” have been added to memcached.

@wanliyang