How To Use Pandas with Parquet Files
Parquet files work well with Pandas because they have a schema and are fast to load.
How to Read a Parquet File with Pandas
The pandas.read_parquet method is the easiest way to read parquet file with pandas.
import pandas as pd
df = pd.read_parquet('my-parquet-file.parquet')
Parquet files have a strict schema, which means the columns in your dataframe should have the correct types. Creating dataframes from other formats, like csv, can sometimes give unexpected column types (like when there is one string in a column of numbers).
There is a trick to make this work quickly though.
pandas.read_parquet
takes an engine
argument, which can be auto
(default), fastparquet
or pyarrow
.
I have found fastparquet
to be the fastest engine and quite a lot faster than the default auto
engine.
To use the fastparquet
engine you will need to install it first
pip install fastparquet
Then you can read your parquet files into pandas
import pandas as pd
df = pd.read_parquet('my-parquet-file.parquet', engine='fastparquet')
How to read a directory of Parquet files
Pandas can read a directory of Parquet files with the read_parquet
method
import pandas as pd
df = pd.read_parquet('path/to/directory')
My preference is to use the fastparquet
engine
>> pip install fastparquet
import pandas as pd
df = pd.read_parquet('path/to/directory', engine='fastparquet')
How to read multiple Parquet files
We can use the pandas.concat method to combine multiple parquet files into one dataframe.
import pandas as pd
df1 = pd.read_parquet('first-file.parquet')
df2 = pd.read_parquet('second-file.parquet')
df = pd.concat([df1, df2])
How to create a Parquet file with Pandas
The pandas.to_parquet method is the easiest way to create a Parquet file with Pandas.
import pandas as pd
df = ...
df.to_parquet('my-output-file.parquet')
Again, you can use the fastparquet engine to speed things up if you install it first
pip install fastparquet
import pandas as pd
df = ...
df.to_parquet('my-output-parquet-file.parquet', engine='fastparquet')
The compression
argument lets you choose the compression technique.
Each compression technique has it's tradeoffs. The options are snappy
(default), gzip
, brotli
and None
.
I usually stick to the default snappy
technique. Snappy is the default compression technique for parquet files in general, so it is well-understood.
A different compression technique might be a better fit for your use case though.
import pandas as pd
df = ...
df.to_parquet('my-output-parquet-file.parquet', engine='fastparquet', compression='snappy')
Pyarrow vs FastParquet
pyarrow
and fastparquet
are the two engines that Pandas can use for reading and writing Parquet files.
They can be selected using the engine
parameter of the
pandas.read_parquet and
pandas.to_parquet
methods.
The main difference between fastparquet
and pyarrow
engines is that they use different libraries under the hood. fastparquet
uses numba
while pyarrow
uses the pyarrow
library.
I generally use fastparquet
engine because I have found it faster than pyarrow
in my use-cases. But both engines are good quality.
If your parquet files are slow to read or write, then it might be worth changing engines and seeing if the performance improves.
How to read a .parquet.gzip file in Python
Pandas can be used to read a .parquet.gzip
file.
import pandas as pd
df = pd.read_parquet('my-file.parquet.gzip`)
Gzip is one of the supported compression techniques for Parquet files. So most methods of reading Parquet files should work.
How to convert CSV to Parquet in Python
We can use the pandas.read_csv and pandas.to_parquet methods to convert a CSV to Parquet in Python.
import pandas as pd
df = pd.read_csv('my-csv.csv')
df.to_parquet('my-parquet.parquet')
You can also convert a CSV to Parquet without writing any code using the CSV to Parquet Converter.
How to convert Parquet to CSV in Python
We can use the pandas.read_parquet and pandas.to_csv methods to convert a Parquet file to CSV in Python
import pandas as pd
df = pd.read_parquet('my-parquet.parquet')
df.to_csv('my-csv.csv')
You can also convert Parquet to CSV without writing any code using the Parquet to CSV Converter.