TypeError: Cannot serialize the column [date] because its data contents are [empty] object dtype.
Hello SO! Currently have got two large HDFStore containing each one node, both the nodes doesn't fit in memory. The nodes don't contain NaN values. Now I would like to merge these two nodes using this. First tested for a small store where all the data fits in one chunk and this was working OK. But now for the case where it has to merge chunk by chunk and it's giving me the following error: TypeError: Cannot serialize the column [date], because its data contents are [empty] object dtype
.
This is the code that I'm running.
>>> import pandas as pd
>>> from pandas import HDFStore
>>> print pd.__version__
0.12.0rc1
>>> h5_1 ='I:/Data/output/test8\\var1.h5'
>>> h5_3 ='I:/Data/output/test8\\var3.h5'
>>> h5_1temp = h5_1.replace('.h5','temp.h5')
>>> A = HDFStore(h5_1)
>>> B = HDFStore(h5_3)
>>> Atemp = HDFStore(h5_1temp)
>>> print A
<class 'pandas.io.pytables.HDFStore'>
File path: I:/Data/output/test8\var1.h5
/var1 frame_table (shape->12626172)
>>> print B
<class 'pandas.io.pytables.HDFStore'>
File path: I:/Data/output/test8\var3.h5
/var3 frame_table (shape->6313086)
>>> nrows_a = A.get_storer('var1').nrows
>>> nrows_b = B.get_storer('var3').nrows
>>> a_chunk_size = 500000
>>> b_chunk_size = 500000
>>> for a in xrange(int(nrows_a / a_chunk_size) + 1):
... a_start_i = a * a_chunk_size
... a_stop_i = min((a + 1) * a_chunk_size, nrows_a)
... a = A.select('var1', start = a_start_i, stop = a_stop_i)
... for b in xrange(int(nrows_b / b_chunk_size) + 1):
... b_start_i = b * b_chunk_size
... b_stop_i = min((b + 1) * b_chunk_size, nrows_b)
... b = B.select('var3', start = b_start_i, stop = b_stop_i)
... Atemp.append('mergev13', pd.merge(a, b , left_index=True, right_index=True,how='inner'))
...
Traceback (most recent call last):
File "<interactive input>", line 9, in <module>
File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 658, in append
self._write_to_group(key, value, table=True, append=True, **kwargs)
File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 923, in _write_to_group
s.write(obj = value, append=append, complib=complib, **kwargs)
File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 3251, in write
return super(AppendableMultiFrameTable, self).write(obj=obj.reset_index(), data_columns=data_columns, **kwargs)
File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 2983, in write
**kwargs)
File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 2715, in create_axes
raise e
TypeError: Cannot serialize the column [date] because
its data contents are [empty] object dtype
Things that I noticed, it mentions that I'm on pandas_version:= '0.10.1', however my pandas version is 0.12.0rc1. Further more some more specific information of the nodes:
>>> A.select_column('var1','date').unique()
array([2006001, 2006009, 2006017, 2006025, 2006033, 2006041, 2006049,
2006057, 2006065, 2006073, 2006081, 2006089, 2006097, 2006105,
2006113, 2006121, 2006129, 2006137, 2006145, 2006153, 2006161,
2006169, 2006177, 2006185, 2006193, 2006201, 2006209, 2006217,
2006225, 2006233, 2006241, 2006249, 2006257, 2006265, 2006273,
2006281, 2006289, 2006297, 2006305, 2006313, 2006321, 2006329,
2006337, 2006345, 2006353, 2006361], dtype=int64)
>>> B.select_column('var3','date').unique()
array([2006001, 2006017, 2006033, 2006049, 2006065, 2006081, 2006097,
2006113, 2006129, 2006145, 2006161, 2006177, 2006193, 2006209,
2006225, 2006241, 2006257, 2006273, 2006289, 2006305, 2006321,
2006337, 2006353], dtype=int64)
>>> A.get_storer('var1').levels
['x', 'y', 'date']
>>> A.get_storer('var1').attrs
/var1._v_attrs (AttributeSet), 12 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['date', 'y', 'x'],
index_cols := [(0, 'index')],
levels := ['x', 'y', 'date'],
nan_rep := 'nan',
non_index_axes := [(1, ['x', 'y', 'date', 'var1'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_multiframe',
values_cols := ['values_block_0', 'date', 'y', 'x']]
>>> A.get_storer('var1').table
/var1/table (Table(12626172,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"date": Int64Col(shape=(), dflt=0, pos=2),
"y": Int64Col(shape=(), dflt=0, pos=3),
"x": Int64Col(shape=(), dflt=0, pos=4)}
byteorder := 'little'
chunkshape := (3276,)
autoIndex := True
colindexes := {
"date": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"y": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"x": Index(6, medium, shuffle, zlib(1)).is_CSI=False}
>>> B.get_storer('var3').levels
['x', 'y', 'date']
>>> B.get_storer('var3').attrs
/var3._v_attrs (AttributeSet), 12 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['date', 'y', 'x'],
index_cols := [(0, 'index')],
levels := ['x', 'y', 'date'],
nan_rep := 'nan',
non_index_axes := [(1, ['x', 'y', 'date', 'var3'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_multiframe',
values_cols := ['values_block_0', 'date', 'y', 'x']]
>>> B.get_storer('var3').table
/var3/table (Table(6313086,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"date": Int64Col(shape=(), dflt=0, pos=2),
"y": Int64Col(shape=(), dflt=0, pos=3),
"x": Int64Col(shape=(), dflt=0, pos=4)}
byteorder := 'little'
chunkshape := (3276,)
autoIndex := True
colindexes := {
"date": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"y": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"x": Index(6, medium, shuffle, zlib(1)).is_CSI=False}
>>> print Atemp
<class 'pandas.io.pytables.HDFStore'>
File path: I:/Data/output/test8\var1temp.h5
/mergev13 frame_table (shape->823446)
Since chunksize is 500000 and shape of node in Atemp is 823446, gives me that at least one chunk is merged. But I cannot figure out where the error is coming from and I also run out of clues trying to discover where it's exactly going wrong. Any help is very much appreciated..
EDIT
By reducing the chunksize of my test store it gives the same error. Of course not good, but now gives me possibility to share. Click here for the code + HDFStores.