ORC vs Parquet - what should I use?
Parquet is a good choice in general. ORC can be better if you are only working with Apache Hive.
ORC and Parquet are both columnar data formats suited to analytical workloads. ORC only works with Apache Hive, but
supports ACID transactions like
DELETE. Parquet works with many kinds of data systems, but
requires the Delta Lake extension to perform ACID transactions.
ORC file format
ORC files store rows of data in a columnar data. Rows are organized into
stripes of 250MB. Each stripe has index metadata
and also a footer.
ORC file layout image courtesy of Apache ORC.
Stripes are compressed individually using either
none (no compression). Decompression can be skipped
for certain stripes during analysis because they are each individually compressed.
ORC files support ACID transactions when used with Apache Hive. This means you can perform operations like
Parquet file format
Parquet specifically uses the striping and assembly algorithms from the Dremel paper.
Parquet file format image courtesy of Apache.
Parquet files do not support ACID transactions like
DELETE. To get ACID transactions you will
need to use an extension like Delta Lake.
Parquet files can be compressed with
brotli codecs. Or they could have no compression.