Parquet Row Group

Trusted by over 10,000 every month

Columnar data in a parquet file is stored as row groups to speed up querying and writing. Information about row groups is stored in the file metadata. This Parquet Row Group Viewer can be used to see the row group metadata in a Parquet file.

Changing the Row Group size can improve the performance of your data system, but there is some nuance.

What is a Row Group

Parquet files store data as columns within Row Groups.

Each Row Group has a column chunk with your raw data (stored in pages on disk). The last part of the Row Group is metadata, which includes information like the min and max value for each column. Row Groups enable predicate pushdown, which is used to speed up queries on Parquet files.

The Row Group size can be configured when writing a Parquet file. A correct Row Group size might speed up Parquet writing and querying.

The Row Group size does not need to be the same between Parquet files. It is possible to query multiple Parquet files with different Row Group sizes.

How to View the Row Group Information for a Parquet File

You can view the row groups of your parquet file using this online tool, or by writing code and using something like DuckDB.

How to View Row Groups Online

You can view row groups using this viewer following these two steps.

  1. Upload your parquet file using the input at the top of the page
  2. View the parquet row group information in the table that appears

Use the Parquet Viewer if you want to view the contents of your parquet file.

Row groups are part of the Parquet Metadata. You can use the Parquet Metadata Viewer to view all of the metadata in a Parquet file.

How to View Row Groups Using DuckDB

The parquet_metadata function can be used to view row group information.

select path_in_schema, row_group_id, row_group_num_rows, row_group_num_columns, row_group_bytes from parquet_metadata('my/file.parquet')

If you want to see all of the metadata in a parquet file you can use the following query

select * from parquet_metadata("my/file.parquet")

I normally calculate the compression ratio for each column at the same time using the following query

select *, total_uncompressed_size / total_compressed_size as compression_ratio from parquet_metadata("my/file.parquet")

How Do Row Groups Speed Up Parquet File Processing

Row groups allow tools to query Parquet files without reading the entire file. This is really important because I/O throughput is often the speed bottleneck in big data processing applications.

A lot of tools, like DuckDB or Clickhouse, implement predicate pushdown. This means that they can figure out what data to read from disk based on the filters or WHERE clauses in a query.

Parquet Row Groups allow data analysis tools to read a small amount of data from each parquet file to perform analysis. Each Row Group has metadata that includes the max and min values of each column. So the querying tool can figure out if any columns from the row group need to be loaded.

Without row groups, these tools would need to read the entire Parquet file into RAM before applying any filtering. Parquet files can be quite large (1GB+). So it might be slow to read lots of complete files into RAM, or we might have to add lots of RAM to the machine.

Row Groups are especially useful for timeseries data. Timeseries data is often queried by time range, e.g. "show me the average electricity price over the last 3 months". If you order a parquet file by timestamp, then you can query parquet files by only reading the row groups in the time range that you are interested in.

Do I Need to Change the Row Group Size

It might be possible to speed up your data analysis by using the correct Row Group size.

But there are a lot of things that can slow down data analysis systems. If your data processing is slow, then there is a good chance that the bottleneck could be somewhere else.

Choosing a better sorting strategy might have more impact than choosing a different Row Group size. E.g. if you have time series data that you query by time-range, then you might want to sort by timestamp. If there are other columns that are often used in filters, then you might want to sort by them.

It is also a good idea to think about whether your system needs to be faster. Tuning things like Row Group size can make things more complicated.

For row groups in particular, the tuning depends on the type of data you have and the specs of the computers you are using to read and write the Parquet files. If you data or types of computers change, then you may need to choose a new Row Group size.

What Are the Symptoms of an Incorrect Row Group Size

If your Row Group size is wrong and having an impact on your system, then you might see the following

Slow parquet file writes

Can happen when the machine writing the Parquet file does not have enough RAM for the entire Row Group, so it has to make two passes to write the Row Group.

Slow queries

Queries from tools like DuckDB or Clickhouse are taking a long time to query Parquet files because they need to read a lot of data.

Query speed can be more impacted by the sorting order of your Parquet file.

Queries are not parallelized

Some tools like DuckDB only assign one thread per Row Group. If there are fewer row groups than CPU cores then increasing the number of row groups might speed up processing.

Conclusion

Overall, I recommend looking at the entire system to figure out:

  1. Does this need to be faster, and
  2. What is slowing down processing

Before tuning Row Group size.

What Is the Best Row Group Size for Parquet

The official docs recommend a 512MB-1GB row group size.

Row group size is a tradeoff overall and your case might vary.

Big row groups:

  • Enable faster I/O throughput because the computer is performing sequential writes
  • Require more RAM because entire row groups are often loaded into RAM

Small row groups:

  • Slower I/O throughput because smaller percentage of reads are sequential
  • Better read parallelization (depending on number of cores)
  • Queries may require less data to be loaded from disk (predicate pushdown)

Bigger Row Groups enable faster I/O because there are more sequential reads and writes. Row groups are often fully loaded into RAM, so you might need a machine with extra RAM.

If Row Groups do not fit into RAM then writing a Parquet file to disk might be slow because your computer might need to perform two passes. Reading a Parquet file might be slow for similar reasons.

If your Row Groups are very small then one query might need to load a lot of different Row Groups. These reads might be slow because they are not sequential, so the I/O throughput might be lower than you expect.

Small row groups can enable very efficient querying if you have chosen a good sorting key. In these cases, the querying tool (DuckDB, Clickhouse, ...) might be able to perform predicate pushdown and only read a small number of small row groups from disk. This can speed up querying a lot because I/O is often the bottleneck in big data analysis systems.

How to Set Parquet Row Group Size

Parquet Row Group sizes are set when the file is written.

Row group sizes don't need to be the same in different Parquet files. You can query multiple Parquet files with different Row Group sizes in one query.

Row group sizes don't normally need to be set explicitly. Most tools have a sensible default that you can just use.

How to Set Parquet Row Group Size With Pandas

Setting the Row Group size with Pandas depends on what engine you are using.

PyArrow engine

The following will create a Parquet file with 1000 rows per Row Group using the PyArrow engine.

df.to_parquet("my_file.parquet", row_group_size=1000, engine="pyarrow")

FastParquet engine

The following will create a Parquet file with 1000 rows per Row Group using the FastParquet engine.

df.to_parquet("my_file.parquet", row_group_offsets=1000, engine="fastparquet")

How to Set Parquet Row Group Size With DuckDB

Row group size is controlled by the ROW_GROUP_SIZE parameter.

The following query will create a Parquet file with 1000 rows per Row Group.

COPY (SELECT * FROM tbl) TO 'result-snappy.parquet' (FORMAT PARQUET, ROW_GROUP_SIZE 100000)

Conclusion

Tuning the Row Group sizes of your Parquet files can lead to faster write and query speed. The best size will depend on your data, how it is sorted, and the specs of the computers you are using.

Overall, before tuning Row Group size I would:

  1. Figure out if your data system is actually too slow, and
  2. Determine where the performance bottleneck is

Before tuning Row Group size. In many cases, choosing a better sort key for your Parquet file might have a bigger impact on query performance than changing the Row Group size.