Hadoop at Twitter

Thursday, 8 April 2010

My name is Kevin Weil and I’m a member of the analytics team at Twitter. We’re collectively responsible for Twitter’s data warehousing, for building out an analysis platform that lets us easily and efficiently run large calculations over the Twitter dataset, and ultimately for turning that data into something actionable that helps the business. We’re fortunate to work with great people from teams across the organization for the latter. The former two are largely on our plate though, and as a result we use Hadoop, Pig, and HBase heavily. Today we’re excited to open source some of the core code we rely on for our data analysis.

TL;DR

We’re releasing a whole bunch of code that we use with Hadoop and Pig specifically around LZO and Protocol Buffers. Use it, fork it, improve upon it. http://github.com/kevinweil/elephant-bird

What’s Hadoop? Pig, HBase?

Hadoop is a distributed computing framework with two main components: a distributed file system and a map-reduce implementation. It is a top-level Apache project, and as such it is fully open source and has a vibrant community behind it.

Imagine you have a cluster of 100 computers. Hadoop’s distributed file system makes it so you can put data “into Hadoop” and pretend that all the hard drives on your machines have coalesced into one gigantic drive. Under the hood, it breaks each file you give it into 64- or 128-MB chunks called blocks and sends them to different machines in the cluster, replicating each block three times along the way. Replication ensures that one or even two of your 100 computers can fail simultaneously, and you’ll never lose data. In fact, Hadoop will even realize that two machines have failed and will begin to re-replicate the data, so your application code never has to care about it!

The second main component of Hadoop is its map-reduce framework, which provides a simple way to break analyses over large sets of data into small chunks which can be done in parallel across your 100 machines. You can read more about it here; it’s quite generic, capable of handling everything from basic analytics through map-tile generation for Google Maps! Google has a proprietary system which Hadoop itself is modeled after; Hadoop is used at many large companies including Yahoo!, Facebook, and Twitter. We’re happy users of Cloudera’s free Hadoop distribution.

Pig is a dataflow language built on top of Hadoop to simplify and speed up common analysis tasks. Instead of writing map-reduce jobs in Java, you write in a higher-level language called PigLatin, and a query compiler turns your statements into an ordered sequence of map-reduce jobs. It enables complex map-reduce job flows to be written in a few easy steps.

HBase is a distributed, column-oriented data store built on top of Hadoop and modeled after Google’s BigTable. It allows for structured data storage combined with low-latency data serving.

How does Twitter Use Hadoop?

Twitter has large data storage and processing requirements, and thus we have worked to implement a set of optimized data storage and workflow solutions within Hadoop. In particular, we store all of our data LZO compressed, because the LZO compression turns out to strike a very good balance between compression ratio and speed for use in Hadoop. Hadoop jobs are generally IO-bound, and typical compression algorithms like gzip or bzip2 are so computationally intensive that jobs quickly become CPU-bound. LZO in contrast was built for speed, so you get 4-5x compression ratio while leaving the CPU available to do real work. For more discussion of LZO, complete with performance comparisons, see this Cloudera blog post we did a while back.

We also make heavy use of Google’s protocol buffers for efficient, extensible, backward-compatible data storage. Hadoop does not mandate any particular format on disk, and common formats like CSV are

space-inefficient: an integer like 2930533523 takes 10 bytes in ASCII.
untyped: is 2930533523 an int, a long, or a string?
not robust to versioning changes: adding a new field, or removing an old one, requires you to change your code
not hierarchical: you cannot store any nested structure

Other solutions like JSON fail fewer of these tests, but protocol buffers retain one key advantage: code generation. You write a quick description of your data structure, and the protobuf library will generate code for working with that data structure in the language of your choice. Because Google designed protobufs for data storage, the serialized format is efficient; integers, for example, are variable-length or zigzag encoded.

So…

The code we are releasing makes it easy to work with LZO-compressed data of all sorts — JSON, CSV, TSV, line-oriented, and especially protocol buffer-encoded — in Hadoop, Pig, and HBase. It also includes a framework for automatically generating all of this code for protobufs given the protobuf IDL. That is, not only will

protoc

generate the standard protobuf code for you, it will now generate Hadoop and Pig layers on top of that which you can immediately plug in to start analyzing data written with these formats. Having code automatically generated from a simple data structure definition has helped us move very quickly and make fewer mistakes in our analysis infrastructure at Twitter. You can even hook in and add your own code generators from within the framework. Please do, and submit back! Much more documentation is available at the github page, http://www.github.com/kevinweil/elephant-bird.

Thanks to Dmitriy Ryaboy, Chuang Liu, and Florian Leibert for help developing this framework. We welcome contributions, forks, and pull requests. If working on stuff like this every day sounds interesting, we’re hiring!

—@kevinweil