It’s been just over two years since we open sourced Scalding and today we are very excited to release the 0.9 version. Scalding at Twitter powers everything from internal and external facing dashboards, to custom relevance and ad targeting algorithms, including many graph algorithms such as PageRank, approximate user cosine similarity and many more.
There have been a wide breadth of new features added to Scalding since the last release:
Columnar storage is a popular technique to optimize analytical workloads in parallel RDBMs. The performance and compression benefits for storing and processing large amounts of data are well documented in academic literature as well as several commercialanalyticaldatabases.
MapReduce is a programming model for processing large data sets, typically used to do distributed computing on clusters of commodity computers. With large amount of processing power at hand, it’s very tempting to solve problems by brute force. However, we often combine clever sampling techniques with the power of MapReduce to extend its utility.
We are a heavy adopter of Apache Hadoop with a large set of data that resides in its clusters, so it’s important for us to understand how these resources are utilized. At our July Hack Week, we experimented with developing HDFS-DU to provide us an interactive visualization of the underlying Hadoop Distributed File System (HDFS).