Faster Ruby: Kiji Update

Thursday, 19 May 2011

In March 2011, we shared Kiji, an improved Ruby runtime. The initial performance gains were relatively modest, but laid the foundation for future improvements. We continued the work and now have some excellent results.

FASTER REMEMBERED SET CALCULATIONS

In Kiji 0.10, every change to the longlife heap required full recalculation of the “remembered set,” the boundary objects referenced from the longlife to the eden heap. For Kiji 0.11, we changed the calculation to an incremental model that only includes newly-allocated objects.

We made this easier by disabling garbage collection during source code parsing, which has a tendency to mutate references in place. Now, if the parser needs more memory, it merely allocates a new heap chunk. This lets us allocate all AST nodes, including those created in instance_eval, on the longlife heap. The result is a big performance boost for applications like template engines that use lots of instance_eval.

MORE OBJECTS IN LONGLIFE

For Kiji 0.11, we now allocate non-transient strings in the longlife heap, along with the AST nodes. This includes strings allocated during parsing, assigned to constants (or members of a constant hash or array), and those that are members of frozen objects. With Ruby’s Kernel.freeze method, big parts of frozen objects are now evicted from the ordinary heap and moved to the longlife heap.

This change is significant. When the twitter.com web application ran Kiji 0.10, it had 450,000 live objects after garbage collection in its ordinary heap. Kiji 0.11 places over 300,000 string objects in the longlife heap, reducing the number of live objects in the ordinary heap to under 150,000. The nearly 66 percent reduction allows the heap to collect much less frequently.

SIMPLIFIED HEAP GROWTH STRATEGY

Ruby Enterprise Edition has a set of environment variables that govern when to run the garbage collector and how to grow and shrink the heaps. After evaluating Ruby’s heap growth strategy, we replaced it with one that is much simpler to configure and works better for server workloads.

As a first step, we eliminated GC_MALLOC_LIMIT. This environment variable prescribes when to force a garbage collection, following a set of C-level malloc() calls. We found this setting to be capricious; it performed best when it was set so high as to be effectively off. By eliminating the malloc limit entirely, the Kiji 0.11 garbage collector runs only when heaps are full, or when no more memory can be allocated from the operating system. This also means that under UNIX-like systems, you can more effectively size the process with ulimit -u.

0.11 now has only these three GC-tuning environment variables:

The first parameter is RUBY_GC_HEAP_SIZE. This parameter determines the number of objects in a heap slab. The value is specified in numbers of objects. Its default value is 32768.
The next parameter is RUBY_GC_EDEN_HEAPS. This parameter specifies the target number of heap slabs for the ordinary heap. Its default value is 24.

The runtime starts out with a single heap slab, and when it fills up, it collects the garbage and allocates a new slab until it reaches the target number. This gradual strategy keeps fragmentation in the heaps low, as it tends to concentrate longer-lived objects in the earlier heap slabs. If the heap is forced to grow beyond the target number of slabs, the runtime releases vacated slabs after each garbage collection in order to restore the target size. Once the application reaches the target size of ordinary heap, it does not go below it.

Since performance is tightly bound to the rate of eden collections (a classic memory for speed tradeoff), this makes the behavior of a long-lived process very predictable. We have had very good results with settings as high as 64.
The final parameter is RUBY_GC_LONGLIFE_LAZINESS, a decimal between 0 and 1, with a default of 0.05. This parameter governs a different heap growth strategy for longlife heap slabs. The runtime releases vacant longlife heap slabs when the ratio of free longlife heap slots to all longlife heap slots after the collection is higher than this parameter. Also, if the ratio is lower after collection, a new heap slab is allocated.
The default value is well-tuned for our typical workload and prevents memory bloat.

We also reversed the order of adding the freed slots onto the free list. Now, new allocations are fulfilled with free slots from older (presumably, more densely-populated) heap slabs first, allowing recently allocated heap slabs to become completely vacant in a subsequent GC run. This may slightly impact locality of reference, but works well for us.

ADDITIONAL CHANGES

We replaced the old profiling methods that no longer applied with our improved memory debugging.

We also removed the “fastmarktable” mode, where the collector used a mark bit in the object slots. Kiji 0.11 uses only the copy-on-write friendly mark table. This lets us reset the mark bits after collection by zeroing out the entire mark table, instead of flipping a bit in every live object.

IT’S IN THE NUMBERS

We updated the performance chart from the first blog post about Kiji with the 0.11 data. As you can see, the new data shows a dramatic improvement for our example intensive workload. While Kiji 0.9 responded to all requests until 90 requests/sec and peaked at 95 responses out of 100 requests/sec, Kiji 0.11 responds to all requests until 120 requests/sec. This is a 30% improvement in throughput across the board, and 2.7x the speed of standard Ruby 1.8.

FULL ALLOCATION TRACING

We found that in order to effectively develop Kiji 0.11, we needed to add more sophisticated memory instrumentation than is currently available for Ruby. As a result, we ended up with some really useful debugging additions that you can turn on as well.

The first tool is a summary of memory stats after GC. It lets you cheaply measure the impact of memory-related changes:

The second tool is an allocation tracer (a replacement for BleakHouse and similar tools). After each GC, the runtime writes files containing full stack traces for the allocation points of all freed and surviving objects. You can easily parse this with AWK to list common object types, allocation sites, and number of objects allocated. This makes it easy to identify allocation hotspots, memory leaks, or objects that persist on the eden and should be manually moved to the longlife.

A sample output for allocation tracing, obtained by running RubySpec under Kiji:

For more information, refer to the README-kiji file in the distribution.

FUTURE DIRECTIONS

0.11 is a much more performant and operable runtime than Kiji 0.10. However, through this work we identified a practical strategy for making an even better, fully-generational version that would apply well to Ruby 1.9. Time will tell if we get to implement it.

We also would like to investigate the relative performance of JRuby.

TRY IT!

We have released the Kiji REE branch on GitHub.

ACKNOWLEDGEMENTS

The following engineers at Twitter contributed to the REE improvements: Rob Benson, Brandon Mitchell, Attila Szegedi, and Evan Weaver.

If you want to work on projects like this, join the flock!

— Attila (@asz)