ORC vs Parquet - what should I use?

ORC vs Parquet

Parquet is a good choice in general. ORC can be better if you are only working with Apache Hive.

ORC and Parquet are both columnar data formats suited to analytical workloads. ORC only works with Apache Hive, but supports ACID transactions like CREATE, UPDATE and DELETE. Parquet works with many kinds of data systems, but requires the Delta Lake extension to perform ACID transactions.

ORC file format

ORC (Optimized Row Columnar) file format was built for Apache Hive as a successor to the RCFile format.

ORC files store rows of data in a columnar data. Rows are organized into stripes of 250MB. Each stripe has index metadata and also a footer.

ORC file format

ORC file layout image courtesy of Apache ORC.

Stripes are compressed individually using either Snappy, Zlib or none (no compression). Decompression can be skipped for certain stripes during analysis because they are each individually compressed.

ORC files support ACID transactions when used with Apache Hive. This means you can perform operations like READ, UPDATE and DELETE.

Parquet file format

Parquet is a columnar file format that is partly based on Google's Dremel paper. Dremel is the query engine that BigQuery was based on.

Parquet specifically uses the striping and assembly algorithms from the Dremel paper.

Parquet files can be read and written by many systems including Apache Hive, Clickhouse, Pandas, Apache Spark and Online Parquet Viewers ;).

Parquet files start and end with magic bytes PAR1 so that they can be identified as Parquet files. The file is split into row groups for each column. Metadata is stored at the end of the file.

ORC file format

Parquet file format image courtesy of Apache.

Parquet files do not support ACID transactions like CREATE, UPDATE or DELETE. To get ACID transactions you will need to use an extension like Delta Lake.

Parquet files can be compressed with snappy (default), gzip, brotli codecs. Or they could have no compression.