How To Use Pandas with Parquet Files

Parquet & Pandas

Parquet files work well with Pandas because they have a schema and are fast to load.

How to Read a Parquet File with Pandas

The pandas.read_parquet method is the easiest way to read parquet file with pandas.

import pandas as pd
df = pd.read_parquet('my-parquet-file.parquet')

Parquet files have a strict schema, which means the columns in your dataframe should have the correct types. Creating dataframes from other formats, like csv, can sometimes give unexpected column types (like when there is one string in a column of numbers).

There is a trick to make this work quickly though.

pandas.read_parquet takes an engine argument, which can be auto (default), fastparquet or pyarrow. I have found fastparquet to be the fastest engine and quite a lot faster than the default auto engine.

To use the fastparquet engine you will need to install it first

pip install fastparquet

Then you can read your parquet files into pandas

import pandas as pd
df = pd.read_parquet('my-parquet-file.parquet', engine='fastparquet')

How to read a directory of Parquet files

Pandas can read a directory of Parquet files with the read_parquet method

import pandas as pd
df = pd.read_parquet('path/to/directory')

My preference is to use the fastparquet engine

>> pip install fastparquet
import pandas as pd
df = pd.read_parquet('path/to/directory', engine='fastparquet')

How to read multiple Parquet files

We can use the pandas.concat method to combine multiple parquet files into one dataframe.

import pandas as pd
df1 = pd.read_parquet('first-file.parquet')
df2 = pd.read_parquet('second-file.parquet')
df = pd.concat([df1, df2])

How to create a Parquet file with Pandas

The pandas.to_parquet method is the easiest way to create a Parquet file with Pandas.

import pandas as pd
df = ...
df.to_parquet('my-output-file.parquet')

Again, you can use the fastparquet engine to speed things up if you install it first

pip install fastparquet
import pandas as pd
df = ...
df.to_parquet('my-output-parquet-file.parquet', engine='fastparquet')

The compression argument lets you choose the compression technique.

Each compression technique has it's tradeoffs. The options are snappy (default), gzip, brotli and None.

I usually stick to the default snappy technique. Snappy is the default compression technique for parquet files in general, so it is well-understood. A different compression technique might be a better fit for your use case though.

import pandas as pd
df = ...
df.to_parquet('my-output-parquet-file.parquet', engine='fastparquet', compression='snappy')

Pyarrow vs FastParquet

pyarrow and fastparquet are the two engines that Pandas can use for reading and writing Parquet files. They can be selected using the engine parameter of the pandas.read_parquet and pandas.to_parquet methods.

The main difference between fastparquet and pyarrow engines is that they use different libraries under the hood. fastparquet uses numba while pyarrow uses the pyarrow library.

I generally use fastparquet engine because I have found it faster than pyarrow in my use-cases. But both engines are good quality.

If your parquet files are slow to read or write, then it might be worth changing engines and seeing if the performance improves.

How to read a .parquet.gzip file in Python

Pandas can be used to read a .parquet.gzip file.

import pandas as pd
df = pd.read_parquet('my-file.parquet.gzip`)

Gzip is one of the supported compression techniques for Parquet files. So most methods of reading Parquet files should work.

How to convert CSV to Parquet in Python

We can use the pandas.read_csv and pandas.to_parquet methods to convert a CSV to Parquet in Python.

import pandas as pd
df = pd.read_csv('my-csv.csv')
df.to_parquet('my-parquet.parquet')

You can also convert a CSV to Parquet without writing any code using the CSV to Parquet Converter.

How to convert Parquet to CSV in Python

We can use the pandas.read_parquet and pandas.to_csv methods to convert a Parquet file to CSV in Python

import pandas as pd
df = pd.read_parquet('my-parquet.parquet')
df.to_csv('my-csv.csv')

You can also convert Parquet to CSV without writing any code using the Parquet to CSV Converter.