ORC vs Parquet - what should I use?
Parquet is a good choice in general. ORC can be better if you are only working with Apache Hive.
ORC and Parquet are both columnar data formats suited to analytical workloads. ORC only works with Apache Hive, but
supports ACID transactions like CREATE
, UPDATE
and DELETE
. Parquet works with many kinds of data systems, but
requires the Delta Lake extension to perform ACID transactions.
ORC file format
ORC (Optimized Row Columnar) file format was built for Apache Hive as a successor to the RCFile format.
ORC files store rows of data in a columnar data. Rows are organized into stripes
of 250MB. Each stripe has index metadata
and also a footer.

ORC file layout image courtesy of Apache ORC.
Stripes are compressed individually using either Snappy
, Zlib
or none
(no compression). Decompression can be skipped
for certain stripes during analysis because they are each individually compressed.
ORC files support ACID transactions when used with Apache Hive. This means you can perform operations like
READ
, UPDATE
and DELETE
.
Parquet file format
Parquet is a columnar file format that is partly based on Google's Dremel paper. Dremel is the query engine that BigQuery was based on.
Parquet specifically uses the striping and assembly algorithms from the Dremel paper.
Parquet files can be read and written by many systems including Apache Hive, Clickhouse, Pandas, Apache Spark and Online Parquet Viewers ;).
Parquet files start and end with magic bytes PAR1
so that they can be identified as Parquet files. The file is split into
row groups for each column. Metadata is stored at the end of the file.

Parquet file format image courtesy of Apache.
Parquet files do not support ACID transactions like CREATE
, UPDATE
or DELETE
. To get ACID transactions you will
need to use an extension like Delta Lake.
Parquet files can be compressed with snappy
(default), gzip
,
brotli
codecs. Or they could have no compression.