Parquet Encoding

Trusted by over 10,000 every month

Parquet file encoding takes tabular data and changes it so that it will have a better compression ratio and take less space on disk. Encoding is performed before compression. Use this online tool to view the encoding data for your Parquet file.

How to View the Encoding for a Parquet File

  1. Upload your parquet file using the input at the top of the page
  2. View the parquet encoding in the table that appears

Use the Parquet Viewer if you want to view the contents of your Parquet file.

What is Parquet File Encoding

Parquet files are typically encoded before they get compressed.

The Parquet file encoding process takes tabular data and changes it so that it will have better compression without losing information. Encoding improves compression because it transforms the data so that it can be stored with less disk space, or so that compression algorithms like zstd will perform better.

Parquet files are encoded before they get compressed.

Encoding can work well for structured data because there are often patterns that we can exploit to reduce file size.

For example, Parquet files often store time series data. Time series data usually has similar differences between consecutive data points. This means that it compresses well with techniques like Delta Encoding (covered later).

Types of Encoding in Parquet Files

Parquet files contain different data types, like floats, ints and strings. The best encoding method depends on the data type and what kinds of values we are storing.

Parquet Dictionary Encoding

Dictionary encoding takes each unique value and assigns it a small integer. The integer is smaller than the original value, so we end up with a space reduction.

Parquet dictionary encoding

Note that the dictionary size increases with the cardinality of the data. Data that has few unique values will compress well with a dictionary encoding. Data that has a lot of different values (high cardinality) will not compress well because the dictionary will become very large.

Dictionary Encoding works well when:

  • There are lots of rows and not many unique values in the data (low cardinality)
  • Each value in the original data takes up a lot of space, for example strings

Dictionary Encoding works poorly when:

  • There are a lot of different unique values in the data (high cardinality)
  • There are not many repeated values

Dictionary encoding works especially poorly on strings that are all different, for a product description or what someone wrote in a text box on your website.

Free form strings are usually unique, which means the encoding dictionary gets really large and dominates the size of the output Parquet file. This is why Parquet files with strings often compress poorly, or are large on disk.

Parquet Delta Encoding

Parquet Delta Encoding stores the relative difference between consecutive values. This is really useful for certain types of data that change in a consistent way.

Parquet uses different types of Delta Encoding for integers and strings.

Parquet Delta Encoding for Integers

Integers (INT32 and INT64) are encoded by looking at the difference in consecutive values.

Parquet delta integer encoding

Parquet Delta Encoding for integers follows this process:

  1. The deltas are calculated for each row
  2. The minimum delta is calculated
  3. The minimum delta is subtracted from each delta to get the "Relative Deltas"
  4. The relative deltas are stored on disk

Storing the relative deltas, instead of the deltas, works really well when the data changes by about the same amount on each row.

Delta encoding works well when data changes "by about the same amount" on each row. It works poorly when the change between consecutive values is completely random.

Time series data is one example where delta encoding works really well. Time series data is often sampled at a fixed interval, so the "relative delta" between samples is often fairly consistent.

Parquet Delta String Encoding

Parquet Delta String encoding uses parts of the previous string as the starting point for the next string.

Encoded values are stored as two parts:

  1. How much of the previous value to use
  2. What to append to the previous value
Parquet delta string encoding

Delta String Encoding works well when consecutive strings build off each other. For example, the second and fourth encoded rows in the image above benefit from the delta string encoding. The first and third rows do not, because they don't build on the previous value.