How To Use Delta Lake in Python with Pandas and without Spark
We can use Pandas with the delta-rs package to read and write delta tables without Spark.
First, install dependencies.
pip install pandas pyarrow deltalake
Load an example dataframe.
iris_df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
Writing a Delta Table with Pandas
We can create a Delta Table using the deltalake writer.
from deltalake.writer import write_deltalake
write_deltalake('tables/foo', iris_df, mode='overwrite')
We should now have the following files.
/tables/foo/_delta_log/00000000000000000000.json
0-e3cf5881-2663-4748-a4ab-afeb30e035af-0.parquet # Note your parquet file might have a different name
The json file contains metadata about the parquet file that was created. The json file on my machine has information
about the table schema, what operation was performed (add), and the operation that was performed (CREATE TABLE
)
amongst other things.
The Parquet file has the data in the delta lake, but might have a different name on your machine.
We can append the same dataframe to the delta table by running write_deltalake
with mode='append'
.
write_deltalake('tables/foo', iris_df, mode='append')
This creates a new json file under _delta_log
and a new Parquet file.
The new json file has information about the operation. That it was an Append
, metadata about the new parquet file,
and the commit timestamp amongst other things.
Reading a Delta Table with Pandas
We can read a Delta Table into Pandas using the DeltaTable
constructor.
from deltalake import DeltaTable
dt = DeltaTable('tables/foo')
df = dt.to_pandas()
This loads the latest version of the Delta Table into a Pandas Dataframe.
We can perform time travel by loading a previous version of the table.
dt = DeltaTable('tables/foo', version=0)
df = dt.to_pandas()
This loaded the first version of the dataframe. That is, the version before we appended an extra dataframe to the Delta Lake.
Conclusion
This post has very briefly gone into how to use Delta Lake with Pandas so that you don't need to use PySpark. This method is best suited to cases where you have a small Delta Table that fits in RAM.