python - 使用 vaex.from_csv 将 csv 转换为 hdf5 错误：“DataFrameArrays”对象没有属性“dtype”

Question

我有一个超过 1300 万行的 csv 文件，我想转换为 hdf5：我可以运行代码：

df_chunk = vx.from_csv(r'df.csv', nrows=20_000_000)

但如果我运行以下代码：

df_chunk.export(r'df.hdf5')

我收到错误：

AttributeError: 'DataFrameArrays' object has no attribute 'dtype'

当我运行时发生同样的错误：

df_chunk = vx.from_csv(r'df.csv', convert='True', nrows=20_000_000)

你能告诉我出了什么问题或者我该如何解决这个问题。谢谢

score 2 · Accepted Answer

我尝试将 python 版本降级到 3.7，重新安装新版本的 Vaex(4.0)，然后运行代码，所有工作都没有错误。感谢您对我的所有关注和帮助。

score 0 · Accepted Answer

错误信息 ( object has no attribute 'dtype') 很有趣。dtype 是 NumPy 的东西（它描述了 NumPy 数组的数据类型）。也许这是一个线索。

我不熟悉 vaex，所以我阅读了他们的文档。:-)

我注意到您没有使用该seperator参数（注意拼写来自文档）。如果你的值真的是逗号分隔，你需要seperator=",".

如果这不起作用，这可能会有所帮助。vaex 4.0.0-dev0 文档显示了读取 CSV 文件和创建 HDF5 文件的其他方法。你试过vx.from_ascii()吗？文档显示了这种方法：

ds = vx.from_ascii("table.csv", seperator=",", names=["x", "y", "z"])

添加names=参数可能有助于 dtype 消息（如果正在使用复合数组）。使用该示例，这可能有效（您必须在列表中创建名称：

df_chunk = vx.from_ascii('df.csv', seperator=",", names=[--add your column names here--], nrows=20_000_000)  
df_chunk.export('df.hdf5')

注意：我从文件名字符串中删除了 r（'df.csv' 而不是 r'df.csv'）。不确定这对这种情况是否重要。

2 回答 2