hdfs - 使用 PyArrow 从 HDFS 读取镶木地板文件

Question

我知道我可以使用 pyarrow 连接到 HDFS 集群pyarrow.hdfs.connect()

我也知道我可以使用pyarrow.parquet's读取镶木地板文件read_table()

但是，read_table()接受文件路径，而hdfs.connect()给我一个HadoopFileSystem实例。

是否可以仅使用 pyarrow（安装了 libhdfs3）来获取驻留在 HDFS 集群中的镶木地板文件/文件夹？我希望得到的是to_pydict()函数，然后我可以传递数据。

score 6 · Accepted Answer

尝试

fs = pa.hdfs.connect(...)
fs.read_parquet('/path/to/hdfs-file', **other_options)

或者

import pyarrow.parquet as pq
with fs.open(path) as f:
    pq.read_table(f, **read_options)

我打开了https://issues.apache.org/jira/browse/ARROW-1848关于添加一些更明确的文档

score 1 · Accepted Answer

我通过 Pydoop 库和 engine = pyarrow 尝试了同样的方法，它对我来说非常有效。这是通用方法。

!pip install pydoop pyarrow
import pydoop.hdfs as hd

#read files via Pydoop and return df

def readParquetFilesPydoop(path):
    with hd.open(path) as f:
        df = pd.read_parquet(f ,engine='pyarrow')
        logger.info ('file: ' +  path  +  ' : ' + str(df.shape))
        return df

hdfs - 使用 PyArrow 从 HDFS 读取镶木地板文件

2 回答 2

Related

Reference