1

我正在尝试按照此笔记本上的示例进行操作。

正如这个github 线程中所建议的:

  1. 我已将 ulimit 提高到 9999。
  2. 我已经将 csv 文件转换为 hdf5

尝试将单个 hdf5 文件打开到数据框中时,我的代码失败:

df = vaex.open('data/chat_history_00.hdf5')

这是其余的代码:

import re
import glob
import vaex
import numpy as np

def tryint(s):
    try:
        return int(s)
    except:
        return s

def alphanum_key(s):
    """ Turn a string into a list of string and number chunks.
        "z23a" -> ["z", 23, "a"]
    """
    return [ tryint(c) for c in re.split('([0-9]+)', s) ]

hdf5_list = glob.glob('data/*.hdf5')
hdf5_list.sort(key=alphanum_key)
hdf5_list = np.array(hdf5_list)

assert len(hdf5_list) == 11, "Incorrect number of files"

# Check how the single file looks like:
df = vaex.open('data/chat_history_10.hdf5')
df

产生的错误:

错误:主线程:vaex:打开“数据/chat_history_00.hdf5”时出错 --------------------------------- ---------------------------------------------------- ValueError Traceback(最近一次调用最后一次)在 1 # 检查单个文件的样子: ----> 2 df = vaex.open('data/chat_history_10.hdf5') 3 df

/usr/local/anaconda3/lib/python3.7/site-packages/vaex/ init .py 在打开(路径,转换,随机播放,copy_index,*args,**kwargs)207 ds = from_csv(路径,copy_index=copy_index , **kwargs) 208 else: --> 209 ds = vaex.file.open(path, *args, **kwargs) 210 if convert and ds: 211 ds.export_hdf5(filename_hdf5, shuffle=shuffle)

/usr/local/anaconda3/lib/python3.7/site-packages/vaex/file/ init .py in open(path, *args, **kwargs) 39 break 40 if dataset_class: ---> 41 dataset = dataset_class (path, *args, **kwargs) 42 返回数据集 43

/usr/local/anaconda3/lib/python3.7/site-packages/vaex/hdf5/dataset.py in init (self, filename, write) 84 self.h5table_root_name = None 85 self._version = 1 ---> 86 self._load() 87 88 def write_meta(self):

/usr/local/anaconda3/lib/python3.7/site-packages/vaex/hdf5/dataset.py in _load(self) 182 def _load(self): 183 if "data" in self.h5file: --> 184 self._load_columns(self.h5file["/data"]) 185 self.h5table_root_name = "/data" 186 if "table" in self.h5file:

/usr/local/anaconda3/lib/python3.7/site-packages/vaex/hdf5/dataset.py in _load_columns(self, h5data, first) 348 self.add_column(column_name, self._map_hdf5_array(data, column['mask '])) 349 else: --> 350 self.add_column(column_name, self._map_hdf5_array(data)) 351 else: 352 transposed = shape 1 < shape[0]

/usr/local/anaconda3/lib/python3.7/site-packages/vaex/dataframe.py in add_column(self, name, f_or_array, dtype) 2929
if len(self) == len(ar): 2930 raise ValueError( “数组的长度为 %s,而 DataFrame 的长度为 %s,由于过滤,(未过滤的)长度为 %s。” % (len(ar), len(self), self.length_unfiltered()) ) -> 2931 raise ValueError("array is of length %s, while the length of the DataFrame is %s" % (len(ar), self.length_original())) 2932 # assert self.length_unfiltered() == len (data), "列的长度应该相等,长度应该是 %d,而它是 %d" % ( self.length_unfiltered(), len(data)) 2933 valid_name = vaex.utils.find_valid_name(name)

ValueError:数组长度为2578961,而DataFrame的长度为6

这是什么意思,我该如何解决?所有文件都有 6 列。

编辑:这是我创建 hdf5 文件的方式:

pd.read_csv(r'G:/path/to/file/data/chat_history-00.csv').to_hdf(r'data/chat_history_00.hdf5', key='data')
4

1 回答 1

2

vaex的JovanGithub上回答了这个问题:

如果您想使用 vaex 以内存映射方式读取数据,则不应使用 pandas .to_hdf。请参阅此链接了解更多详情。

我改用这个:

vdf = vaex.from_pandas(df, copy_index=False)
vdf.export_hdf5('chat_history_00.hdf5')
于 2020-01-26T21:18:34.633 回答