Scalding

Wednesday, 11 January 2012

Today, we’re excited to open source Scalding, a Scala API for Cascading. Cascading is a thin Java library and API that sits on top of Apache Hadoop’s MapReduce layer. Scalding is comprised of two main components:

  • a DSL to make MapReduce computations look very similar to Scala’s collection API
  • A wrapper for Cascading to make it simpler to define the typical use cases of jobs, tests and describing data sources on a Hadoop Distributed File System (HDFS) or local disk

Hadoop, of course, is a distributed system for dealing with large data sets. Within Twitter, we use Scalding to query large data sets like Tweets on HDFS. For example, if we wanted to find out the number of times that each URL is tweeted in a given day, we’d create a Scalding query:

  1. StatusSource()
  2.   .flatMapTo('created_date, 'url) { s =>
  3.    for( url <- urls(s.getText))
  4.      yield (RichDate(s.getCreatedAt).toString(DATE_WITH_DASH), url)
  5.   }
  6.   .groupBy(‘created_date, ‘url) {
  7.     _.size(‘urlCnt) //Count the number of appearences of the URL
  8.   }
  9.   .write(Tsv(args(“output”)))

In addition, more complex algorithms can be implemented such as a toy page-rank implementation included within the examples. In comparison to languages such as Apache Pig that separate the query language from the user defined functionality, with Scalding everything is integrated into one language. In most cases, one file will describe your job.

If you have any questions, feel free to file any issues on Github or follow the project on Twitter via @scalding.