How To Use Delta Lake in Python with Pandas and without Spark
First, install dependencies.
pip install pandas pyarrow deltalake
Load an example dataframe.
iris_df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
Writing a Delta Table with Pandas
We can create a Delta Table using the deltalake writer.
from deltalake.writer import write_deltalake
write_deltalake('tables/foo', iris_df, mode='overwrite')
We should now have the following files.
0-e3cf5881-2663-4748-a4ab-afeb30e035af-0.parquet # Note your parquet file might have a different name
The json file contains metadata about the parquet file that was created. The json file on my machine has information
about the table schema, what operation was performed (add), and the operation that was performed (
amongst other things.
The Parquet file has the data in the delta lake, but might have a different name on your machine.
We can append the same dataframe to the delta table by running
write_deltalake('tables/foo', iris_df, mode='append')
This creates a new json file under
_delta_log and a new Parquet file.
The new json file has information about the operation. That it was an
Append, metadata about the new parquet file,
and the commit timestamp amongst other things.
Reading a Delta Table with Pandas
We can read a Delta Table into Pandas using the
from deltalake import DeltaTable
dt = DeltaTable('tables/foo')
df = dt.to_pandas()
This loads the latest version of the Delta Table into a Pandas Dataframe.
We can perform time travel by loading a previous version of the table.
dt = DeltaTable('tables/foo', version=0)
df = dt.to_pandas()
This loaded the first version of the dataframe. That is, the version before we appended an extra dataframe to the Delta Lake.
This post has very briefly gone into how to use Delta Lake with Pandas so that you don't need to use PySpark. This method is best suited to cases where you have a small Delta Table that fits in RAM.