python - Pyarrow.lib.Schema 与 pyarrow.parquet.Schema

Question

当我尝试跨多个分区的镶木地板文件加载时，由于缺少用空值填充架构的数据，某些架构被无效推断。我认为在 pyarrow.parquet.ParquetDataset 中指定模式可以解决此问题，但我不知道如何构建正确的 pyarrow.parquet.Schema 类型的模式。一些示例代码：

import pyarrow as pa
import pa.parquet as pq    
test_schema = pa.schema([pa.field('field1', pa.string()), pa.field('field2', pa.float64())])
paths = ['test_root/partition1/file1.parquet', 'test_root/partition2/file2.parquet']
dataset = pq.ParquetDataset(paths, schema=schema)

和错误：

AttributeError: 'pyarrow.lib.Schema' object has no attribute 'to_arrow_schema'

但是我在文档（ https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html ）中找不到任何关于如何构建 pyarrow.parquet.Schema 模式的文档并且有只做了一个 pyarrow.lib.Schema ，它给出了上述错误。

score 2 · Accepted Answer

目前还没有一个 API 可以在 Python 中构建 Parquet 模式。不过，您可以使用从特定文件中读取的文件（请参阅参考资料pq.ParquetFile(...).schema）。

您能否在 ARROW JIRA 项目上打开一个问题以请求该功能以在 Python 中构建 Parquet 模式？

https://issues.apache.org/jira

python - Pyarrow.lib.Schema 与 pyarrow.parquet.Schema

1 回答 1

Related

Reference