我有一个包含多个.feather
文件的 s3 文件夹,我想将这些文件加载到dask
使用 python 中,如下所述:Load many feather files in a folder into dask。我尝试了两种方法都给了我不同的错误:
import pandas as pd
import dask
ftr_filenames = [
's3://my-bucket/my-dir/file_1.ftr',
's3://my-bucket/my-dir/file_2.ftr',
.
.
.
's3://my-bucket/my-dir/file_30.ftr'
]
delayed_files = dask.bytes.open_files(ftr_filenames, 'rb')
# ---------------------------option 1 --------------------------------
dfs = [dask.delayed(pd.read_feather)(f) for f in delayed_files]
# ---------------------------option 2 --------------------------------
dfs = [dask.delayed(feather.read_dataframe)(f) for f in delayed_files]
# --------------------------------------------------------------------
df = dask.dataframe.from_delayed(dfs)
# -------------------------- error 1 ------------------------------
# 'S3File' object has no attribute '__fspath__'
# -------------------------- error 2 ------------------------------
# Cannot convert OpenFile to pyarrow.lib.NativeFile
是否有另一种方法可以从 s3 读取这些文件?这里的主要目的是规避由pd.concat
.