Graduating Apache Parquet

Tweet

ASF, the Apache Software Foundation, recently announced the graduation of Apache Parquet, a columnar storage format for the Apache Hadoop ecosystem. At Twitter, we’re excited to be a founding member of the project.

Apache Parquet is built to work across programming languages, processing frameworks, data models and query engines including Apache Hive, Apache Drill, Impala and Presto.

At Twitter, Parquet has helped us scale by reducing storage requirements by at least one-third on large datasets, as well as improving scan and deserialization time. This has translated into hardware savings and reduced latency for accessing data. Furthermore, Parquet’s integration with so many tools creates opportunities and flexibility for query engines to help optimize performance.

Since we announced Parquet, these open source communities have integrated the project: Apache Crunch, Apache Drill, Apache Hive, Apache Pig, Apache Spark, Apache Tajo, Kite, Impala, Presto and Scalding.

What’s new?

The Parquet community just released version 1.7.0 with several new features and bug fixes. This update includes:

  • A new filter API for Java and DSL for Scala that uses statistics metadata to filter large batches of records without reading them
  • A memory manager that will scale down memory consumption to help avoid crashes
  • Improved MR and Spark job startup time
  • Better support for evolving schemas with type promotion when reading
  • More logical types for storing dates, times, and more
  • Improved compatibility between Hive, Avro and other object models

As usual, this release also includes many other bug fixes. We’d like to thank the community for reporting these and contributing fixes. Parquet 1.7.0 is now available for download.

Future work

Although Parquet has graduated, there’s still plenty to do, and the Parquet community is planning some major changes to enable even more improvements.

First is updating the internals to work with the zero-copy read path in Hadoop, making reads even faster by not copying data into Parquet’s memory space. This will also enable Parquet to take advantage of HDFS read caching and should pave the way for significant performance improvements.

After moving to zero-copy reads, we plan to add a vectorized read API that will enable processing engines like Drill, Presto and Hive to save time by processing column data in batches before reconstructing records in memory, if at all.

We also plan to add more advanced statistics based record filtering to Parquet. Statistics based record filtering allows us to drop entire batches of data with only reading a small amount of metadata). For example, we’ll take advantage of dictionary encoded columns and apply filters to batches of data by examining a column’s dictionary, and in cases where no dictionary is available, we plan to store a bloom filter in the metadata.

Aside from performance, we’re working on adding POJO support in the Parquet Avro object model that works the same way Avro handles POJOs in avro-reflect. This will make it easier to use existing Java classes that aren’t based on one of the already-supported object models and enable applications that rely on avro-reflect to use Parquet as their data format.

Getting involved

Parquet is an independent open source project at the ASF. To get involved, join the community mailing lists and any of the community hangouts the project holds. We welcome everyone to participate to make Parquet better and look forward to working with you in the open.

Acknowledgements

We would like to thank Ryan Blue from Cloudera for helping craft parts of this post and the wider Parquet community for contributing to the project. Specifically, contributors from a number of organizations (Twitter, Netflix, Criteo, MaPR, Stripe, Cloudera, AmpLab) contributed to this release. We’d also like to thank these people: Daniel Weeks, Zhenxiao Luo, Nezih Yigitbasi, Tongjie Chen, Mickael Lacour, Jacques Nadeau, Jason Altekruse, Parth Chandra, Colin Marc (@colinmarc), Avi Bryant (@avibryant), Ryan Blue (@6d352b5d3028e4b), Marcel Kornacker, Nong Li (@nongli), Tom White (@tom_e_white), Sergio Pena, Matt Massie (@matt_massie), Tianshuo Deng, Julien Le Dem, Alex Levenson, Chris Aniszczyk and Lukas Nalezenec.