python - Pandas to_hdf 的溢出错误

Question

Python新手在这里。

我正在尝试使用 to_hdf 使用 lz4 压缩将大型数据帧保存到 HDF 文件中。

我使用 Windows 10、Python 3、Pandas 20.2

我收到错误“OverflowError: Python int too large to convert to C long”。

没有任何机器资源接近其限制（RAM、CPU、SWAP 使用）

以前的帖子讨论了dtype，但下面的例子表明还有一些其他问题，可能与大小有关？

import numpy as np
import pandas as pd


# sample dataframe to be saved, pardon my French 
n=500*1000*1000
df= pd.DataFrame({'col1':[999999999999999999]*n,
                  'col2':['aaaaaaaaaaaaaaaaa']*n,
                  'col3':[999999999999999999]*n,
                  'col4':['aaaaaaaaaaaaaaaaa']*n,
                  'col5':[999999999999999999]*n,
                  'col6':['aaaaaaaaaaaaaaaaa']*n})

# works fine
lim=200*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')

# works fine
lim=300*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')


# Error
lim=400*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')


....
OverflowError: Python int too large to convert to C long

score 3 · Accepted Answer

我遇到了同样的问题，似乎它确实与数据框的大小而不是与 dtype 相关（我将所有列都存储为字符串，并且能够将它们分别存储到 .h5）。

对我有用的解决方案是使用mode='a'. 正如pandas文档中所建议的那样：mode{'a', 'w', 'r+'}, default 'a': 'a': append，打开现有文件进行读写，如果文件不存在则被建造。

所以示例代码看起来像：

batch_size = 1000
for i, df_chunk in df.groupby(np.arange(df.shape[0]) // batch_size):
    df_chunk.to_hdf('df.h5','table', complib= 'blosc:lz4', mode='a')

score 2 · Accepted Answer

正如@Giovanni Maria Strampelli 指出的那样，@Artem Snorkovenko 的答案只保存了最后一批。Pandas 文档说明如下：

为了向现有 HDF 文件添加另一个 DataFrame 或 Series，请使用附加模式和不同的 a 键。

这是保存所有批次的可能解决方法（根据@Artem Snorkovenko的答案进行了调整）：

for i in range(len(df)):
    sr = df.loc[i] #pandas series object for the given index
    sr.to_hdf('df.h5', key='table_%i'%i, complib='blosc:lz4', mode='a')

此代码使用不同的键保存每个 Pandas Series 对象。每个键都由 i 索引。

要在保存后加载现有的 .h5 文件，可以执行以下操作：

i = 0
dfdone = False #if True, all keys in the .h5 file are successfully loaded.
srl = [] #df series object list
while dfdone == False:
    #print(i) #this is to see if code is working properly.
    try: #check whether current i value exists in the keys of the .h5 file
        sdfr = pd.read_hdf('df.h5', key='table_%i'%i) #Current series object
        srl.append(sdfr) #append each series to a list to create the dataframe in the end.
        i += 1 #increment i by 1 after loading the series object
    except: #if an error occurs, current i value exceeds the number of keys, all keys are loaded.
        dfdone = True #Terminate the while loop.

df = pd.DataFrame(srl) #Generate the dataframe from the list of series objects.

我使用了一个 while 循环，假设我们不知道 .h5 文件中数据帧的确切长度。如果长度已知，也可以使用 for 循环。

请注意，我没有在这里以块的形式保存数据帧。因此，加载过程的当前形式不适合保存在块中，其中每个块的数据类型将是 DataFrame。在我的实现中，每个保存的对象都是 Series，DataFrame 是从 Series 列表生成的。我提供的代码可以调整为以块的形式保存并从 DataFrame 对象列表生成 DataFrame（一个很好的起点可以在 ths Stack Overflow entry中找到。）。

python - Pandas to_hdf 的溢出错误

2 回答 2

Related

Reference