In March we announced the Parquet project, the result of a collaboration between Twitter and Cloudera intended to create an open-source columnar storage format library for Apache Hadoop.
We’re happy to release Parquet 1.0.0, more at: https://t.co/xKilQU22a5 90+ merged pull requests since announcement: https://t.co/lrKdrNiUQA— Parquet Format ( @ParquetFormat) July 30, 2013
Today, we’re happy to tell you about a significant Parquet milestone: a 1.0 release, which includes major features and improvements made since the initial announcement. But first, we’ll revisit why columnar storage is so important for the Hadoop ecosystem.
Parquet is an open-source columnar storage format for Hadoop. Its goal is to provide a state of the art columnar storage layer that can be taken advantage of by existing Hadoop frameworks, and can enable a new generation of Hadoop data processing architectures such as Impala, Drill, and parts of the Hive ‘Stinger’ initiative. Parquet does not tie its users to any existing processing framework or serialization library.
The idea behind columnar storage is simple: instead of storing millions of records row by row (employee name, employee age, employee address, employee salary…) store the records column by column (all the names, all the ages, all the addresses, all the salaries). This reorganization provides significant benefits for analytical processing:
These effects combine to make columnar storage a very attractive option for analytical processing.
Implementing a columnar storage format that can be used by the many various Hadoop-based processing engines is tricky. Not all data people store in Hadoop is a simple table — complex nested structures abound. For example, one of Twitter’s common internal datasets has a schema nested seven levels deep, with over 80 leaf nodes. Slicing such objects into a columnar structure is non-trivial, and we chose to use the approach described by Google engineers in their paper Dremel: Interactive Analysis of Web-Scale Datasets. Another complexity derives from the fact that we want it to be relatively easy to plug new processing frameworks into Parquet.
Despite these challenges, we are pleased with the results so far: our approach has resulted in integration with Hive, Pig, Cascading, Impala, and an in-progress implementation with Drill, and is currently in production at Twitter.
Parquet 1.0 is available for download on GitHub and via Maven Central. It provides the following features:
Eighteen contributors from multiple organizations (Twitter, Cloudera, Criteo, UC Berkeley AMPLab, Stripe and others) contributed to this release.
When we announced Parquet, we encouraged the greater Hadoop community to contribute to the design and implementation of the format. Parquet 1.0 features many contributions from the community as well as the initial core team of committers; we will highlight two of these improvements below.
The ability to efficiently encode columns in which the number of unique values is fairly small (10s of thousands) can lead to a significant compression and processing speed boost. Nong Li and Marcel Kornacker of Cloudera teamed up with Julien Le Dem of Twitter to define a dictionary encoding specification, and implemented it both in Java and C++ (for Impala). Parquet’s dictionary encoding is automatic, so users do not have to specify it — Parquet will dynamically turn it on and off as applicable, given the data it is compressing.
Hybrid bit packing and RLE encoding
Columns of numerical values can often be efficiently stored using two approaches: bit packing and run-length encoding (RLE):
Our hybrid implementation of bit-packing and RLE monitors the data stream, and dynamically switches between the two types of encoding, depending on what gives us the best compression. This is extremely effective for certain kinds of integer data and combines particularly well with dictionary encoding.
One of the major goals of Parquet is to provide a columnar storage format for Hadoop that can be used by many projects and companies, rather than being tied to a specific tool. We are heartened to find that so far, the bet on open-source collaboration and tool independence has paid off. This release includes quality contributions from 18 developers, who are affiliated with a number of different companies and institutions.
We are far from done, of course; there are many improvements we would like to make and features we would like to add. To learn more about these items, you can see our roadmap on the README or look for the “pick me up!” label on GitHub. You can also join our mailing list at email@example.com and tweet at us @ParquetFormat to join the discussion.