python - 如何使 pandas HDFStore 'put' 操作更快

Question

我正在尝试使用 pandas hdf5 构建一个 ETL 工具包。

我的计划是

从mysql中提取表到DataFrame；
将此 DataFrame 放入 HDFStore；

但是当我执行第 2 步时，我发现将数据框放入 *.h5 文件会花费太多时间。

源mysql服务器中的表大小：498MB
- 52列
- 924,624 条记录
将数据框放入后 *.h5 文件的大小：513MB
- 'put' 操作花费 849.345677137 秒

我的问题是：
这个时间成本正常吗？
有没有办法让它更快？

更新 1

谢谢杰夫

我的代码很简单：

extract_store = HDFStore('extract_store.h5')
extract_store['df_staff'] = df_staff
当我尝试“ptdump -av file.h5”时，出现错误，但我仍然可以从这个 h5 文件加载数据框对象：

tables.exceptions.HDF5ExtError：HDF5 错误回溯

文件“../../../src/H5F.c”，第 1512 行，在 H5Fopen 中
无法打开文件文件“../../../src/H5F.c”，第 1307 行，在 H5F_open
无法读取超级块文件“../../../src/H5Fsuper.c”，第 305 行，在 H5F_super_read 中
无法找到文件签名文件“../../../src/H5Fsuper.c”，第 153 行，在 H5F_locate_signature
中找不到有效的文件签名

HDF5 错误回溯结束

无法打开/创建文件“extract_store.h5”

其他一些信息：
- 熊猫版本：'0.10.0'
- 操作系统：ubuntu 服务器 10.04 x86_64
- cpu: 8 * Intel(R) Xeon(R) CPU X5670 @ 2.93GHz
- 内存总量：51634016 kB

我会将 pandas 更新为 0.10.1-dev 并重试。

更新 2

我已将熊猫更新为“0.10.1.dev-6e2b6ea”
但是时间成本并没有减少，这次花费了884.15 s秒
'ptdump -av file.h5' 的输出是：

    /（根组）''  
      /._v_attrs (AttributeSet)，4个属性：  
       [类：='组'，  
        PYTABLES_FORMAT_VERSION := '2.0',  
        标题：=''，  
        版本：='1.0']  
    /df_bugs (组) ''  
      /df_bugs._v_attrs（属性集），12个属性：  
       [类：='组'，  
        标题：=''，  
        版本 := '1.0',  
        axis0_variety := '常规',  
        axis1_variety := '常规',  
        block0_items_variety := '常规',  
        block1_items_variety := '常规',  
        block2_items_variety := '常规',  
        nblocks := 3,  
        ndim := 2,  
        pandas_type := '框架',  
        pandas_version := '0.10.1']  
    /df_bugs/axis0 (Array(52,)) ''  
      atom := StringAtom(itemsize=19, shape=(), dflt='')  
      主调：= 0  
      味道 := 'numpy'  
      byteorder := '无关'  
      块状 := 无  
      /df_bugs/axis0._v_attrs（属性集），7个属性：  
       [类：='阵列'，  
        风味 := 'numpy',  
        标题：=''，  
        版本 := '2.3',  
        种类 := '字符串',  
        名称 := 无，  
        转置：=真]  
    /df_bugs/axis1（数组（924624，））''  
      atom := Int64Atom(shape=(), dflt=0)  
      主调：= 0  
      味道 := 'numpy'  
      字节序 := '小'  
      块状 := 无  
      /df_bugs/axis1._v_attrs（属性集），7个属性：  
       [类：='阵列'，  
        风味 := 'numpy',  
        标题：=''，  
        版本 := '2.3',  
        种类 := '整数',  
        名称 := 无，  
        转置：=真]  
    /df_bugs/block0_items (Array(5,)) ''  
      atom := StringAtom(itemsize=12, shape=(), dflt='')  
      主调：= 0   
      味道 := 'numpy'  
      byteorder := '无关'  
      块状 := 无  
      /df_bugs/block0_items._v_attrs（属性集），7个属性：  
       [类：='阵列'，  
        风味 := 'numpy',  
        标题：=''，  
        版本 := '2.3',  
        种类 := '字符串',  
        名称 := 无，  
        转置：=真]  
    /df_bugs/block0_values (Array(924624, 5)) ''  
      atom := Float64Atom(shape=(), dflt=0.0)  
      主调：= 0  
      味道 := 'numpy'  
      字节序 := '小'  
      块状 := 无  
      /df_bugs/block0_values._v_attrs（属性集），5个属性：  
       [类：='阵列'，  
        风味 := 'numpy',  
        标题：=''，  
        版本 := '2.3',  
        转置：=真]  
    /df_bugs/block1_items (Array(19,)) ''  
      atom := StringAtom(itemsize=19, shape=(), dflt='')  
      主调：= 0  
      味道 := 'numpy'  
      byteorder := '无关'  
      块状 := 无  
      /df_bugs/block1_items._v_attrs（属性集），7个属性：  
       [类：='阵列'，  
        风味 := 'numpy',  
        标题：=''，  
        版本 := '2.3',  
        种类 := '字符串',  
        名称 := 无，  
        转置：=真]  
    /df_bugs/block1_values (Array(924624, 19)) ''  
      atom := Int64Atom(shape=(), dflt=0)  
      主调：= 0  
      味道 := 'numpy'  
      字节序 := '小'  
      块状 := 无  
      /df_bugs/block1_values._v_attrs（属性集），5个属性：  
       [类：='阵列'，  
        风味 := 'numpy',  
        标题：=''，   
        版本 := '2.3',  
        转置：=真]  
    /df_bugs/block2_items (Array(28,)) ''  
      atom := StringAtom(itemsize=18, shape=(), dflt='')  
      主调：= 0  
      味道 := 'numpy'  
      byteorder := '无关'  
      块状 := 无  
      /df_bugs/block2_items._v_attrs（属性集），7个属性：  
       [类：='阵列'，  
        风味 := 'numpy',  
        标题：=''，  
        版本 := '2.3',
        种类 := '字符串',  
        名称 := 无，  
        转置：=真]  
    /df_bugs/block2_values (VLArray(1,)) ''  
      原子 = 对象原子（）  
      byteorder = '无关'  
      nrows = 1  
      风味='numpy'  
      /df_bugs/block2_values._v_attrs（属性集），5个属性：  
       [类：='VLARRAY'，  
        伪原子 := '对象',  
        标题：=''，   
        版本：='1.3'，  
        转置：=真]

我在下面尝试了您的代码（将数据帧放入 hdfstore 中，参数“表”为 True），但出现错误，似乎不支持 python 的数据时间类型：

例外：找不到正确的原子类型 -> [dtype->object] 类型“datetime.datetime”的对象没有 len()

更新 3

谢谢杰夫。抱歉耽搁了。

表。版本：'2.4.0'。
是的，884 秒只是 put 操作成本，没有来自 mysql 的 pull 操作
一行数据框（df.ix[0]）：

bug_id 1
分配给 185
bug_file_loc 无
bug_severity 严重
bug_status 关闭
创建_ts 1998-05-06 21:27:00
delta_ts 2012-05-09 14:41:41
short_desc 两个游标。
host_op_sys 未知
guest_op_sys 未知
优先级 P3
rep_platform IA32
记者56
product_id 7
类别 ID 983
组件 ID 12925
分辨率固定
目标里程碑ws1
qa_contact 412
status_whiteboard                         
票数 0
关键词 SR
上次差异 2012-05-09 14:41:41
曾经确认过 1
记者可访问 1
cclist_accessible 1
估计时间 0.00
剩余时间 0.00
截止日期 无
别名 无
found_in_product_id 0
found_in_version_id 0
found_in_phase_id 0
cf_type 缺陷
cf_reported_by 开发
cf_attempted NaN
cf_failed NaN
cf_public_summary                         
cf_doc_impact 0
cf_security 0
cf_build NaN
cf_branch                                 
cf_change NaN
cf_test_id NaN
cf_regression 未知
cf_reviewer 0
cf_on_hold 0
cf_public_severity ---
cf_i18n_impact 0
cf_eta 无
cf_bug_source ---
cf_viss 无
名称：0，长度：52

数据框的图片（只需在 ipython 笔记本中输入“df”）：

Int64Index：924624 个条目，0 到 924623
数据列：
bug_id 924624 非空值
assign_to 924624 个非空值
bug_file_loc 427318 非空值
bug_severity 924624 非空值
bug_status 924624 非空值
creation_ts 924624 个非空值
delta_ts 924624 个非空值
short_desc 924624 非空值
host_op_sys 924624 个非空值
guest_op_sys 924624 个非空值
优先级 924624 个非空值
rep_platform 924624 非空值
记者 924624 非空值
product_id 924624 非空值
category_id 924624 个非空值
component_id 924624 非空值
分辨率 924624 非空值
target_milestone 924624 个非空值
qa_contact 924624 非空值
status_whiteboard 924624 个非空值
投票 924624 个非空值
关键字 924624 个非空值
lastdiffed 924509 个非空值
everconfirmed 924624 个非空值
report_accessible 924624 非空值
cclist_accessible 924624 非空值
估计时间 924624 个非空值
剩余时间 924624 个非空值
截止日期 0 非空值
别名 0 非空值
found_in_product_id 924624 个非空值
found_in_version_id 924624 个非空值
found_in_phase_id 924624 个非空值
cf_type 924624 非空值
cf_reported_by 924624 个非空值
cf_attempted 89622 非空值
cf_failed 89587 非空值
cf_public_summary 510799 非空值
cf_doc_impact 924624 非空值
cf_security 924624 非空值
cf_build 327460 非空值
cf_branch 614929 非空值
cf_change 300612 非空值
cf_test_id 12610 非空值
cf_regression 924624 非空值
cf_reviewer 924624 非空值
cf_on_hold 924624 个非空值
cf_public_severity 924624 非空值
cf_i18n_impact 924624 非空值
cf_eta 3910 非空值
cf_bug_source 924624 非空值
cf_viss 725 个非空值
数据类型：float64(5)、int64(19)、object(28)

在“convert_objects()”之后：

数据类型：datetime64[ns](2)、float64(5)、int64(19)、object(26)

并将转换后的数据帧放入 hdfstore 成本：749.50 s :)
- 似乎减少“对象”数据类型的数量是降低时间成本的关键
并将转换后的数据帧放入具有参数“表”为真的 hdfstore 仍然返回该错误

/usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes（self，axes，obj，validate , nan_rep, data_columns, min_itemsize, **kwargs)
   2203 加薪
   2204 除外（例外），详细信息：
-> 2205 raise Exception("找不到正确的原子类型 -> [dtype->%s] %s" % (b.dtype.name, str(detail)))
   2206 j += 1
   2207
例外：找不到正确的原子类型 -> [dtype->object] 类型“datetime.datetime”的对象没有 len()

我正在尝试放置没有日期时间列的数据框

更新 4

mysql中有4列类型为datetime：
- 创建_ts
- delta_ts
- 最后的差异
- 最后期限

调用 convert_objects() 后：

创建_ts：

时间戳：1998-05-06 21:27:00

delta_ts：

时间戳：2012-05-09 14:41:41

最后的差异

datetime.datetime(2012, 5, 9, 14, 41, 41)

截止日期始终为无，无论在调用“convert_objects”之前还是之后

没有任何

放置不带“lastdiff”列的数据框需要691.75 秒
当放置没有列'lastdiff'的数据框并将参数'table'设置为True时，我遇到了一个新错误，：

/usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes（self，axes，obj，validate , nan_rep, data_columns, min_itemsize, **kwargs)
   2203 加薪
   2204 除外（例外），详细信息：
-> 2205 raise Exception("找不到正确的原子类型 -> [dtype->%s] %s" % (b.dtype.name, str(detail)))
   2206 j += 1
   2207

例外：找不到正确的原子类型 -> 'Decimal' 类型的 [dtype->object] 对象没有 len()

'estimated_time'、'remaining_time'、'cf_viss' 列的类型在 mysql 中是 'decimal'

更新 5

我已通过以下代码将这些“十进制”类型列转换为“浮点”类型：

no_diffed_converted_df_bugs.estimated_time = no_diffed_converted_df_bugs.estimated_time.map(float)

现在，时间成本为372.84 s
但 'table' 版本的 put 仍然引发错误：

/usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes（self，axes，obj，validate , nan_rep, data_columns, min_itemsize, **kwargs)
   2203 加薪
   2204 除外（例外），详细信息：
-> 2205 raise Exception("找不到正确的原子类型 -> [dtype->%s] %s" % (b.dtype.name, str(detail)))
   2206 j += 1
   2207

例外：找不到正确的原子类型 -> [dtype->object] 类型为“datetime.date”的对象没有 len()

score 4 · Accepted Answer

我非常确信您的问题与 DataFrames 中实际类型的类型映射以及 PyTables 如何存储它们有关。

具有固定表示的简单类型（floats/ints/bools），这些被映射到固定的 c 类型
如果可以正确转换日期时间，则会处理它们（例如，它们的 dtype 为“datetime64[ns]”，尤其是 datetimes.date 未处理（NaN 是另一回事，根据使用情况可能会导致整个列类型被错误处理）
字符串被映射（在 Storer 对象到 Object 类型，Table 将它们映射到 String 类型）
未处理 Unicode
所有其他类型在存储器中作为对象处理，或者为表抛出异常

这意味着如果您正在对存储器（固定表示）进行放置，那么所有不可映射的类型都将变为对象，请参阅this 。PyTables 腌制这些列。请参阅下面的 ObjectAtom 参考

http://pytables.github.com/usersguide/libref/declarative_classes.html#the-atom-class-and-its-descendants

表格将引发无效类型（我应该在此处提供更好的错误消息）。如果您尝试存储映射到 ObjectAtom 的类型（出于性能原因），我想我也会提供警告。

要强制某些类型，请尝试其中一些：

import pandas as pd

# convert None to nan (its currently Object)
# converts to float64 (or type of other objs)
x = pd.Series([None])
x = x.where(pd.notnull(x)).convert_objects()

# convert datetime like with embeded nans to datetime64[ns]
df['foo'] = pd.Series(df['foo'].values, dtype = 'M8[ns]')

这是 64 位 linux 上的示例（文件为 1M 行，磁盘大小约为 1 GB）

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: pd.__version__
Out[3]: '0.10.1.dev'

In [3]: import tables

In [4]: tables.__version__
Out[4]: '2.3.1'

In [4]: df = pd.DataFrame(np.random.randn(1000 * 1000, 100), index=range(int(
   ...: 1000 * 1000)), columns=['E%03d' % i for i in xrange(100)])

In [5]: for x in range(20):
   ...:     df['String%03d' % x] = 'string%03d' % x

In [6]: df
Out[6]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Columns: 120 entries, E000 to String019
dtypes: float64(100), object(20)

# storer put (cannot query) 
In [9]: def test_put():
   ...:     store = pd.HDFStore('test_put.h5','w')
   ...:     store['df'] = df
   ...:     store.close()

In [10]: %timeit test_put()
1 loops, best of 3: 7.65 s per loop

# table put (can query)
In [7]: def test_put():
      ....:     store = pd.HDFStore('test_put.h5','w')
      ....:     store.put('df',df,table=True)
      ....:     store.close()


In [8]: %timeit test_put()
1 loops, best of 3: 21.4 s per loop

score 2 · Accepted Answer

如何使这更快？

使用 'io.sql.read_frame' 将数据从 sql db 加载到数据帧。因为“read_frame”将通过将类型为“十进制”的列转换为浮点数来处理它们。
填充每列的缺失数据。
在进行操作之前调用函数“DataFrame.convert_objects”
如果 dateframe 中有字符串类型的列，请使用“table”而不是“storer”

store.put('key', df, table=True)

做完这些工作，在同样的数据集下，put操作的性能有了很大的提升：

CPU times: user 42.07 s, sys: 28.17 s, total: 70.24 s
Wall time: 98.97 s

第二次测试的配置文件日志：

在 68.688 CPU 秒内进行 95984 次函数调用（95958 次原语调用）

   排序：内部时间

   ncalls tottime percall cumtime percall filename:lineno(function)
      445 16.757 0.038 16.757 0.038 {numpy.core.multiarray.array}
       19 16.250 0.855 16.250 0.855 {'tables.tableExtension.Table' 对象的方法'_append_records'}
       16 7.958 0.497 7.958 0.497 {'numpy.ndarray'对象的方法'astype'}
       19 6.533 0.344 6.533 0.344 {pandas.lib.create_hdf_rows_2d}
        4 6.284 1.571 6.388 1.597 {'tables.tableExtension.Row'对象的方法'_fillCol'}
       20 2.640 0.132 2.641 0.132 {pandas.lib.maybe_convert_objects}
        1 1.785 1.785 1.785 1.785 {pandas.lib.isnullobj}
        7 1.619 0.231 1.619 0.231 {'numpy.ndarray'对象的方法'flatten'}
       11 1.059 0.096 1.059 0.096 {pandas.lib.infer_dtype}
        1 0.997 0.997 41.952 41.952 pytables.py:2468(write_data)
       19 0.985 0.052 40.590 2.136 pytables.py:2504(write_data_chunk)
        1 0.827 0.827 60.617 60.617 pytables.py:2433（写入）
     1504 0.592 0.000 0.592 0.000 {“tables.hdf5Extension.Array”对象的“_g_readSlice”方法}
        4 0.534 0.133 13.676 3.419 pytables.py:1038(set_atom)
        1 0.528 0.528 0.528 0.528 {pandas.lib.max_len_string_array}
        4 0.441 0.110 0.571 0.143 internals.py:1409(_stack_arrays)
       35 0.358 0.010 0.358 0.010 {方法'复制''numpy.ndarray'对象}
        1 0.276 0.276 3.135 3.135 internals.py:208(fillna)
        5 0.263 0.053 2.054 0.411 common.py:128(_isnull_ndarraylike)
       48 0.253 0.005 0.253 0.005 {'tables.hdf5Extension.Array'对象的方法'_append'}
        4 0.240 0.060 1.500 0.375 internals.py:1400(_simple_blockify)
        1 0.234 0.234 12.145 12.145 pytables.py:1066(set_atom_string)
       28 0.225 0.008 0.225 0.008 {'tables.hdf5Extension.Array' 对象的方法'_createCArray'}
       36 0.218 0.006 0.218 0.006 {'tables.hdf5Extension.Array' 对象的方法'_g_writeSlice'}
     6110 0.155 0.000 0.155 0.000 {numpy.core.multiarray.empty}
        4 0.097 0.024 0.097 0.024 {方法'所有'的'numpy.ndarray'对象}
        6 0.084 0.014 0.084 0.014 {tables.indexesExtension.keysort}
       18 0.084 0.005 0.084 0.005 {'tables.hdf5Extension.Leaf'对象的方法'_g_close'}
    11816 0.064 0.000 0.108 0.000 文件.py:1036(_getNode)
       19 0.053 0.003 0.053 0.003 {'tables.hdf5Extension.Leaf'对象的方法'_g_flush'}
     1528 0.045 0.000 0.098 0.000 array.py:342(_interpret_indexing)
    11709 0.040 0.000 0.042 0.000 文件.py:248(__getitem__)
        2 0.027 0.013 0.383 0.192 index.py:1099(get_neworder)
        1 0.018 0.018 0.018 0.018 {numpy.core.multiarray.putmask}
        4 0.013 0.003 0.017 0.004 index.py:607(final_idx32)