Parquet is a columnar data storage format that is part of the hadoop ecosystem.
If you are in the habit of saving large csv files to disk as part of your data processing workflow, it can be worth switching to parquet for these type of tasks. It will result in smaller files that are quicker to load.
With large datasets or expensive computations it’s convenient to dump the resulting dataframes to parquet so that you can easily and quickly load them later. For example, if after you initially load your data, you pass it through a series of (sometimes time-consuming steps) to clean and transform it, it can be useful to then dump that dataframe to parquet so that you can load it easily next time you want to use it, or share it with someone else.
To save a dataframe to parquet
df.to_parquet('df.parquet.gzip', compression='gzip')
To load a data frame from parquet
If you have a dataframe saved in parquet format you can do
pd.read_parquet('df.parquet.gzip')
parquet engines
There are a couple of parquet libraries you can use under the hood. The default is pyarrow
. However if any of your dataframe columns contain complex objects such as dict
s you may want to switch to fastparquet
.
pip install fastparquet
df.to_parquet('df.parquet.gzip', compression='gzip', engine='fastparquet')