Delta vs Parquet - what are the differences?
Use Delta if you need ACID transactions like CREATE
, UPDATE
and DELETE
on your data lake. Use Parquet if you want
wider compatability with data tools.
The main advantage of Delta Lake over Parquet is that you can perform ACID transactions like CREATE
, UPDATE
and DELETE
. The
main disadvantage is that there is less support for Delta Lake than Parquet in data libraries and tooling.
What is Parquet
Parquet is a columnar file format that is useful for data analytics. The Parquet format is partly based on Google's Dremel paper. Dremel is the query engine that BigQuery was based on.
Parquet files can be read and understood by a lot of different kinds of data systems including Clickhouse, Pandas, and Apache Spark.
You can open parquet files without writing code using this Parquet Viewer.
Parquet files are often used in data lakes because they compress really well and support predicate pushdown so they can be analyzed quickly.
Parquet files are immutable, which means that they cannot be updated. The only way to update a Parquet file is to overwrite it.
What is Delta Lake
Delta Lake is an extension to Parquet that adds
ACID transactions like CREATE
, UPDATE
and DELETE
.
Delta lake uses a transaction log to provide updates to Parquet files without completely overwriting everything. The transaction logs also enable features like time-travel, so you can do things like point-in-time data restore.
Delta lake files are still Parquet files, but they have specific metadata and are also read/written in specific ways.
To work with Delta Lake you will need to use a library or system that supports them.
The main disadvantage of Delta Lake compared to Parquet is that there is less support from libraries and data tooling.