首先确保你有一个相当新的 pandas 和 pyarrow 版本:
pyenv shell 3.8.2
python -m venv venv
source venv/bin/activate
pip install pandas pyarrow
pip freeze | grep pandas # pandas==1.2.3
pip freeze | grep pyarrow # pyarrow==3.0.0
然后您可以使用partition_cols
生成分区拼花文件:
import pandas as pd
# example dataframe with 3 rows and columns year,month,day,value
df = pd.DataFrame(data={'year': [2020, 2020, 2021],
'month': [1,12,2],
'day': [1,31,28],
'value': [1000,2000,3000]})
df.to_parquet('./mydf', partition_cols=['year', 'month', 'day'])
这会产生:
mydf/year=2020/month=1/day=1/6f0258e6c48a48dbb56cae0494adf659.parquet
mydf/year=2020/month=12/day=31/cf8a45116d8441668c3a397b816cd5f3.parquet
mydf/year=2021/month=2/day=28/7f9ba3f37cb9417a8689290d3f5f9e6e.parquet