python - 使用 pyarrow 和 json.dump 将 json 文件保存到 hdfs

Question

我正在尝试使用 pyarrow 将 json 文件保存在 HDFS 中。这是我的代码的样子。

from pyarrow import hdfs
fs = hdfs.connect(driver='libhdfs')
with fs.open(outputFileVal1, 'wb') as fp:
    json.dump(list(value1set), fp)

这给出了一个错误说TypeError: a bytes-like object is required, not 'str'

当我尝试使用 joblib.dump 或 pickle.dump 时，它可以工作，但不会以 json 格式保存。有没有办法使用pyarrow将json文件直接保存到hdfs。

score 2 · Accepted Answer

看起来您可能需要使用包装器，将写入的数据编码json.dump为二进制文件，使用chunk.encode('utf8'). 就像是

class Utf8Encoder(object);

    def __init__(self, fp):
        self.fp = fp

    def write(self, data):
        if not isinstance(data, bytes):
            data = data.encode('utf-8')
        self.fp.write(data)

然后你可以写

json.dump(..., UtfEncoder(fp))

python - 使用 pyarrow 和 json.dump 将 json 文件保存到 hdfs

1 回答 1

Related

Reference