python - Pandas Pytables 警告和性能缓慢

Question

我一直在为一些大型金融数据集测试 pandas 和 pytables，并遇到了一个真正的绊脚石：

当存储在 pytables 文件中时，pandas 似乎将多维数据存储在非常长的行中，而不是列中。

尝试这个：

from pandas import *
df = DataFrame({'col1':randn(100000000),'col2':randn(100000000)})
store = HDFStore('test.h5')
store['data'] = df    #should be a warning here about exceeding the maximum recommended rowsize
store.handle

输出：

File(filename=test7.h5, title='', mode='a', rootUEP='/', filters=Filters(complevel=0, shuffle=False, fletcher32=False))
/ (RootGroup) ''
/data (Group) ''
/data/axis0 (Array(2,)) ''
  atom := StringAtom(itemsize=4, shape=(), dflt='')
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := None
/data/axis1 (Array(100000000,)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/data/block0_items (Array(2,)) ''
  atom := StringAtom(itemsize=4, shape=(), dflt='')
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := None
/data/block0_values (Array(2, 100000000)) ''
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None

我不完全确定，但我认为结合错误消息，Array(2,100000000) 表示具有 2 行和 100,000,000 列的二维数组。这也是它在 HDFView 中的显示方式。

我一直在经历极其糟糕的表现（在某些情况下， data['ticks'].head() 需要 10 秒），这是什么原因？

score 4 · Accepted Answer

我已经在 GitHub 上交叉链接了这个问题：

http://github.com/pydata/pandas/issues/1824

我个人并没有意识到这个问题，坦率地说，对于 PyTables 或 HDF5（无论谁是罪魁祸首）来说，这有点令人失望。

python - Pandas Pytables 警告和性能缓慢

1 回答 1

Related

Reference