Posts from all blogs: hadoop

Discovery and Consumption of Analytics Data at Twitter

Tags

Details about how data engineers and scientists discover and consume analytics data at Twitter.

Read more...

Hadoop filesystem at Twitter

Details about our HDFS deployments: HA, Federation, and ViewFs.

Read more...

Graduating Apache Parquet

Apache Parquet, a columnar storage format for Hadoop, is graduating from the Apache incubator.

Read more...

Scalding 0.9: Get it while it’s hot!

It’s been just over two years since we open sourced Scalding and today we are very excited to release the 0.9 version. Scalding at Twitter powers everything from internal and external facing dashboards, to custom relevance and ad targeting algorithms, including many graph algorithms such as PageRank, approximate user cosine similarity and many more.

There have been a wide breadth of new features added to Scalding since the last release:

Read more...

Dremel made simple with Parquet

Columnar storage is a popular technique to optimize analytical workloads in parallel RDBMs. The performance and compression benefits for storing and processing large amounts of data are well documented in academic literature as well as several commercialanalyticaldatabases.

Read more...

Streaming MapReduce with Summingbird

Today we are open sourcing Summingbird on GitHub under the ALv2.

Read more...

Announcing Parquet 1.0: Columnar Storage for Hadoop

In March we announced the Parquet project, the result of a collaboration between Twitter and Cloudera intended to create an open-source columnar storage format library for Apache Hadoop.

Read more...

hRaven and the @HadoopSummit

Today marks the start of the Hadoop Summit, and we are thrilled to be a part of it. A few of our engineers will be participating in talks about our Hadoop usage at the summit:

Read more...

Dimension Independent Similarity Computation (DISCO)

MapReduce is a programming model for processing large data sets, typically used to do distributed computing on clusters of commodity computers. With large amount of processing power at hand, it’s very tempting to solve problems by brute force. However, we often combine clever sampling techniques with the power of MapReduce to extend its utility.

Read more...

Visualizing Hadoop with HDFS-DU

We are a heavy adopter of Apache Hadoop with a large set of data that resides in its clusters, so it’s important for us to understand how these resources are utilized. At our July Hack Week, we experimented with developing HDFS-DU to provide us an interactive visualization of the underlying Hadoop Distributed File System (HDFS).

Read more...