Parquet Encoding
Trusted by over 10,000 every month
Parquet file encoding takes tabular data and changes it so that it will have a better compression ratio and take less space on disk. Encoding is performed before compression. Use this online tool to view the encoding data for your Parquet file.
How to View the Encoding for a Parquet File
- Upload your parquet file using the input at the top of the page
- View the parquet encoding in the table that appears
Use the Parquet Viewer if you want to view the contents of your Parquet file.
What is Parquet File Encoding
Parquet files are typically encoded before they get compressed.
The Parquet file encoding process takes tabular data and changes it so that it will have better compression without losing information. Encoding improves compression because it transforms the data so that it can be stored with less disk space, or so that compression algorithms like zstd will perform better.
Encoding can work well for structured data because there are often patterns that we can exploit to reduce file size.
For example, Parquet files often store time series data. Time series data usually has similar differences between consecutive data points. This means that it compresses well with techniques like Delta Encoding (covered later).
Types of Encoding in Parquet Files
Parquet files contain different data types, like floats, ints and strings. The best encoding method depends on the data type and what kinds of values we are storing.
Parquet Dictionary Encoding
Dictionary encoding takes each unique value and assigns it a small integer. The integer is smaller than the original value, so we end up with a space reduction.
Note that the dictionary size increases with the cardinality of the data. Data that has few unique values will compress well with a dictionary encoding. Data that has a lot of different values (high cardinality) will not compress well because the dictionary will become very large.
Dictionary Encoding works well when:
- There are lots of rows and not many unique values in the data (low cardinality)
- Each value in the original data takes up a lot of space, for example strings
Dictionary Encoding works poorly when:
- There are a lot of different unique values in the data (high cardinality)
- There are not many repeated values
Dictionary encoding works especially poorly on strings that are all different, for a product description or what someone wrote in a text box on your website.
Free form strings are usually unique, which means the encoding dictionary gets really large and dominates the size of the output Parquet file. This is why Parquet files with strings often compress poorly, or are large on disk.
Parquet Delta Encoding
Parquet Delta Encoding stores the relative difference between consecutive values. This is really useful for certain types of data that change in a consistent way.
Parquet uses different types of Delta Encoding for integers and strings.
Parquet Delta Encoding for Integers
Integers (INT32 and INT64) are encoded by looking at the difference in consecutive values.
Parquet Delta Encoding for integers follows this process:
- The deltas are calculated for each row
- The minimum delta is calculated
- The minimum delta is subtracted from each delta to get the "Relative Deltas"
- The relative deltas are stored on disk
Storing the relative deltas, instead of the deltas, works really well when the data changes by about the same amount on each row.
Delta encoding works well when data changes "by about the same amount" on each row. It works poorly when the change between consecutive values is completely random.
Time series data is one example where delta encoding works really well. Time series data is often sampled at a fixed interval, so the "relative delta" between samples is often fairly consistent.
Parquet Delta String Encoding
Parquet Delta String encoding uses parts of the previous string as the starting point for the next string.
Encoded values are stored as two parts:
- How much of the previous value to use
- What to append to the previous value
Delta String Encoding works well when consecutive strings build off each other. For example, the second and fourth encoded rows in the image above benefit from the delta string encoding. The first and third rows do not, because they don't build on the previous value.
Parquet Tools
Use these Parquet Tools to work with Parquet files on Windows, Mac, Linux, ChromeOS and Android.
Parquet Viewer
View and filter Parquet files
Query Parquet With SQL
Write SQL to query your Parquet File
Explore
Find correlations in your Parquet File
Convert
File format converter
Parquet Compression Viewer
View the compression of a parquet file
Parquet Data Types Viewer
View the data types of a parquet file
Parquet Encoding Viewer
View the encoding of a parquet file
Parquet Metadata Viewer
View the metadata of a parquet file
Parquet Row Groups Viewer
View the row groups of a parquet file
Parquet Schema Viewer
View the schema a parquet file
Sample Parquet File
Download a sample parquet file for testing