Infrastructure

How Twitter uses rasdaemon for hardware reliability

By

and

Friday, 6 January 2023

At Twitter, our on premise data centers contain hundreds of thousands of hardware servers and millions of server hardware components.Twitter’s hardware, like all hardware, is subject to a diverse array of failures. Some of these failures can be transient and intermittent which makes debugging and finding the root cause of the problem difficult if proper monitoring, detection and handling of the hardware faults is not in place.

As service level monitoring has improved, the service owners have become more aware of a poorly performing machine, but determining what component is at fault is still as complex as ever. This leads to machines with transient hardware faults to repeatedly end up in a site operations repair loop with no root cause ever determined. A machine is often reinstalled and returned to service only to fail again shortly thereafter. Additionally, the sprawl of hardware fault detection plugins written by individual teams to handle errors has complicated and confused service owners seeking reliable and available hardware. As a service owner, a hardware engineer, or a site operations engineer the number of places one needs to look to find out what went wrong at a hardware level is quite expensive and ever changing.

Coming to existing hardware health detection utilities, MCElog and edac-utils are tools that were used to monitor correctable/uncorrectable memory errors as well as machine check exceptions on bare metal hosts. However, recent changes in the Linux kernel have made some of our metrics, exported from current utilities for machine check and memory error handling/monitoring, less reliable. Mcelog is now deprecated and edac-utils is, for the most part, not maintained.

Hence, we decided as a team to provide a comprehensive, clear, and centralized way to monitor and handle hardware failures so that service owners can remove themselves from the hardware detection and repair path, and site operations teams can quickly and easily identify the failure and take effective serviceability actions. Ensuring reliability, availability and serviceability is absolutely essential to provide a seamless experience to the various services that run on these hardware servers.

For this, we leveraged rasdaemon, a standard linux open source utility which provides vastly improved RAS capabilities for hardware. The long term goal is for it to be the one-stop tool to collect, filter and report all hardware error events reported by the Linux kernel. This also helps us to remove blockers to CentOS 8/CentOS 9 migration, increase the fidelity of hardware monitoring signals, and reduce overhead across teams generated by non-actionable inaccurate failure detection. We also used page-offlining in many situations, whenever appropriate instead of taking the entire server out of production which saved us money.

Events now covered by rasdaemon include:

MC (Memory Controller) events.
- Corrected, Uncorrected, Fatal errors are counted and exposed, in detail
MCE (Machine Check Exception) events across a variety of platform types.
- This replaces mcelog for collecting/exposing hardware failures generated by the CPU when detected.
Disk Errors -
- Block errors reported:
  - EOPNOTSUPP, "operation not supported error"
  - ETIMEDOUT, "timeout error"
  - ENOSPC, "critical space allocation error"
  - ENOLINK, "recoverable transport error"
  - EREMOTEIO, "critical target error"
  - EBADE, "critical nexus error"
  - ENODATA, "critical medium error"
  - EILSEQ, "protection error"
  - ENOMEM, "kernel resource error"
  - EBUSY, "device resource error"
  - EAGAIN, "nonblocking retry error"
  - EREMCHG, "dm internal retry error"
  - EIO, "I/O error"
Devlink Errors
PCIe AER events

Some of these, such as disk errors, are supplemental to other metrics, such as S.M.A.R.T. data, which also provides useful signals for disk wear/failure. Others, like MCE and MC are replacing our current plugins, since the underlying methods of exposing these have since become unreliable to do changes in the kernel.

During the rollout, we overcame several challenges. Extensive company-wide communication was essential to ensure the safety of the migration. We followed the “make before break” model to ensure feature parity before disabling other plugins and migrating to rasdaemon. We searched our codebase extensively to find every mention of edac-utils and mcelog to ensure we had taken everything into account during migration. We looked at every dashboard of every service to ensure we were not breaking anything, since observability is an important aspect required to ensure hardware reliability. We also canaried extensively and performed a slow roll-out across our fleet. As a result, rasdaemon has reduced MTTD and MTTR for our hardware servers. Our experience with rasdaemon has been very good and we recommend the use of rasdaemon to the wider industry.

We’d like to thank Anchal Agarwal, Aras Saulys, Carlos Rios, David Johansen, Morgan Horst, Thomas David Mackey, and many others who contributed to this work in different aspects.

This post is unavailable

This post is unavailable.

Only on X

Post