python - 使用 PyArrow 读取 CSV

Question

我有大型 CSV 文件，我最终希望将其转换为镶木地板。由于内存限制和处理 NULL 值的困难（这在我的数据中很常见），Pandas 无济于事。我检查了 PyArrow 文档，并且有用于读取镶木地板文件的工具，但我没有看到任何有关读取 CSV 的信息。我错过了什么，还是这个功能与 PyArrow 不兼容？

score 4 · Accepted Answer

我们正在开发此功能，现在有一个拉取请求：https ://github.com/apache/arrow/pull/2576 。您可以通过测试来提供帮助！

score 2 · Accepted Answer

您可以使用分块读取 CSV pd.read_csv(chunksize=...)，然后使用 Pyarrow 一次写入一个块。

一个警告是，正如您所提到的，如果您有一列在一个块中都是空值，Pandas 将给出不一致的 dtypes，因此您必须确保块大小大于数据中最长的空值运行。

这从标准输入读取 CSV 并将 Parquet 写入标准输出（Python 3）。

#!/usr/bin/env python
import sys

import pandas as pd
import pyarrow.parquet

# This has to be big enough you don't get a chunk of all nulls: https://issues.apache.org/jira/browse/ARROW-2659
SPLIT_ROWS = 2 ** 16

def main():
    writer = None
    for split in pd.read_csv(sys.stdin.buffer, chunksize=SPLIT_ROWS):
        table = pyarrow.Table.from_pandas(split, preserve_index=False)
        # Timestamps have issues if you don't convert to ms. https://github.com/dask/fastparquet/issues/82
        writer = writer or pyarrow.parquet.ParquetWriter(sys.stdout.buffer, table.schema, coerce_timestamps='ms', compression='gzip')
        writer.write_table(table)
    writer.close()

if __name__ == "__main__":
    main()

python - 使用 PyArrow 读取 CSV

2 回答 2

Related

Reference