We are open-sourcing Rezolus, our high-resolution systems performance telemetry agent. Rezolus began in an effort to help uncover performance anomalies and utilization spikes that were too brief to be captured through our normal observability and metrics systems. It has proven to be useful to help us quantify workload characteristics, provide data to drive optimization efforts, and has been used to diagnose runtime performance issues. We have been running Rezolus in production for over a year and are excited to release it as open-source software.
Rezolus provides a collection of signals to help us make sense of fine-grained runtime behavior. We’ve found it particularly helpful in understanding and optimizing performance. With a single agent, we’re able to get telemetry from a wide range of sources. To our knowledge, no other open source project offers such comprehensive insight in a single package.
Rezolus was born out of a need to understand systems performance on fine-grained timescales. We found that while running very high throughput synthetic benchmarks, there were very brief, but sometimes significant performance anomalies. Our existing telemetry, which samples minutely, was failing to reflect these anomalies. This was because the anomalies, which were about 10 seconds in duration, were being masked by a low sample rate relative to the length of the anomalies. This made it difficult to understand what was happening and tune the system for higher performance.
We have since used Rezolus to get insight into many performance incidents. One time, several services were experiencing repeatedly degraded success rates for a few minutes at a time. These services each found they were being throttled by a backend service. The team responsible for that service didn’t see anything on the existing telemetry that they could use to figure out what was happening during the minutes where throttling was occurring. But knowing that throttling decisions are made on a finer timescale than the default telemetry collection, they began to suspect sub-minutely bursts. By deploying Rezolus to the backend service, we were able to see that even though average request rates weren’t significantly elevated, there were bursts of over 5x the baseline traffic during which CPU utilization was bursting to 100%. We were also able to identify exactly when they happened. With the additional telemetry from Rezolus, we were able to correlate with the backend service logs and determine the source of the spikes. Another time, we observed elevated tail latency on our memcached servers. With Rezolus, we saw that the sub-minutely CPU utilization was over 2x the baseline CPU utilization and that these CPU usage spikes were always happening at the same time, 55 seconds after the minute. This information made it easy to correlate the problem with a background task that was running 55 seconds after the minute. Eliminating that background task solved the issue.
Why did our normal metrics fail to capture the bursts in these cases? According to the Nyquist-Shannon Sampling Theorem, the sampling rate must be at least twice the duration of the shortest burst in order to accurately reflect the intensity of a burst. Most telemetry produces a minutely time series, which is far too coarse for most spikes experienced in software systems. Rezolus allows configurable sampling rate, so we can match resolution to spike length without excessive resource consumption. At 10Hz sampling, we are able to reflect consecutive bursts running 200 milliseconds or more, a resolution good enough for most services at Twitter. The resource footprint it takes to run Rezolus at this setting is tiny, typically under 15% CPU and 60MB memory.
Due to the size of our infrastructure, we collect and store telemetry on a minutely basis. This helps us to keep the cost of telemetry down. But since Rezolus is able to do preprocessing of samples through the use of histograms, we’re able to set it to sample at a much higher frequency. This allows us to both capture and measure brief bursts and anomalies. Rezolus then exports percentiles from those histograms, which can be collected and aggregated on a minutely basis. Signs of brief anomalies, which might otherwise be masked in a minutely average, are preserved in the tail portion of the percentiles.
Rezolus uses plug-in samplers to collect telemetry from a variety of sources. Different samplers can be turned on and off, or configured differently depending on need. One set of system performance samplers reads counters and gauges from linux kernel sources to get telemetry on CPU usage, network utilization, and disk utilization. Rezolus also can collect data from hardware and software performance counters which give us more insight into how the CPU is being utilized, for example, measuring the number of cycles per instruction, cache hit-rates, and branch predictor performance. In addition, Rezolus also supports eBPF (Extended Berkeley Packet Filter) for kernel instrumentation using kprobes and tracepoints, and from this we can get telemetry on scheduler latency, block io size distribution, filesystem latency, and more. Another use-case for Rezolus is as a proxy between an application and a traditional metrics collector. In this mode, Rezolus is used to sample application metrics at a high-frequency and performs preprocessing by using histograms. The included Memcache compatible sampler has proven useful to help us capture bursts in traffic to Twemcache servers. By pointing our normal metrics collector at the Rezolus endpoint, we were able to start getting additional insight into the traffic patterns to Twemcache without making any code changes to the server. This has some exciting potential to be used more broadly with other applications to help us provide increased insight into service performance without making code changes or needing to collect and aggregate metrics at higher resolution.
For more information on systems performance telemetry, see Julia Evans’s zine about tracing with perf and a blog post about using Rust to write eBPF tracing tools; Brendan Gregg’s writeup on eBPF tracing tools; and the bcc repository, which has additional examples on how you can use eBPF for observability.
We’ve found the combination of the high-resolution sampling and rich telemetry sources in Rezolus to be very powerful. It helps us to better quantify workload characteristics, diagnose runtime performance issues, and provide insights into achieving better infrastructure efficiency.
Open-sourcing Rezolus marks an important milestone for the project. From this point on, we will continue our development of Rezolus on the public GitHub repository. We invite you to check out the project on GitHub and welcome all forms of collaboration, such as issues and pull requests. We hope that Rezolus will be useful to others outside of Twitter, and look forward to building a community around it.