Twitter at the Hadoop Summit

By

Wednesday, 13 June 2012

Apache Hadoop is a fundamental part of Twitter infrastructure. The massive computational and storage capacity it provides us is invaluable for analyzing our data sets, continuously improving user experience, and powering features such as “who to follow” recommendations, tailored follow suggestions for new users and “best of Twitter” emails. We developed and open-sourced a number of technologies, including the recent Elephant Twin project that help our engineers be productive with Hadoop. We will be talking about some of them at the Hadoop Summit this week:

Real-time analytics with Storm and Hadoop (@nathanmarz)
Storm is a distributed and fault-tolerant real-time computation system, doing for real-time computation what Hadoop did for batch computation. Storm can be used together with Hadoop to make a potent realtime analytics stack; Nathan will discuss how we’ve combined the two technologies at Twitter to do complex analytics in real-time.

Training a Smarter Pig: Large-Scale Machine Learning at Twitter (@lintool)
We’ll present a case study of Twitter`s integration of machine learning tools into its existing Hadoop-based, Pig-centric analytics platform. In our deployed solution, common machine learning tasks such as data sampling, feature generation, training, and testing can be accomplished directly in Pig, via carefully crafted loaders, storage functions, and user-defined functions. This means that machine learning is just another Pig script, which allows seamless integration with existing infrastructure for data management, scheduling, and monitoring in a production environment. This talk is based on a paper we presented at SIGMOD 2012.

Scalding: Twitter`s new DSL for Hadoop (@posco)
Hadoop uses a functional programming model to represent large-scale distributed computation. Scala is thus a very natural match for Hadoop. We will present Scalding, which is built on top of Cascading. Scalding brings an API very similar to Scala`s collection API to allow users to write jobs as they might locally and run those Jobs at scale. This talk will present the Scalding DSL and show some example jobs for common use cases.

Hadoop and Vertica: The Data Analytics Platform at Twitter (@billgraham)
Our data analytics platform uses a number of technologies, including Hadoop, Pig, Vertica, MySQL and ZooKeeper, to process hundreds of terabytes of data per day. Hadoop and Vertica are key components of the platform. The two systems are complementary, but their inherent differences create integration challenges. This talk is an overview of the overall system architecture focusing on integration details, job coordination and resource management.

Flexible In-Situ Indexing for Hadoop via Elephant Twin (@squarecog)
Hadoop workloads can be broadly divided into two types: large aggregation queries that involve scans through massive amounts of data, and selective “needle in a haystack” queries that significantly restrict the number of records under consideration. Secondary indexes can greatly increase processing speed for queries of the second type. We will present Twitter`s generic, extensible in-situ indexing framework Elephant Twin which was just open sourced: unlike “trojan layouts,” no data copying is necessary, and unlike Hive, our integration at the Hadoop API level means that all layers in the stack above can benefit from indexes.

As you can tell, our uses of Hadoop are wide and varied. We are looking forward to exchanging notes with other practitioners and learning about upcoming developments in the Hadoop ecosystem. Hope to see you there and if this sort of thing gets you excited, reach out to us, as we are hiring!

- Dmitriy Ryaboy, Engineering Manager, Analytics (@squarecog)