一段时间以来,我一直在我的脚本中使用 pandas,尤其是以一种易于访问的方式存储大型数据集。几天前我偶然发现了这个问题,到目前为止还没有解决它。
问题是,在我将一个巨大的数据框存储到 hdf5 文件中后,当我稍后将其加载回来时,它有时有一个或多个列(仅来自对象类型列)完全无法访问并返回“NoneType 对象不可迭代”错误。
当我使用内存中的帧时,没有问题,即使数据集比下面的示例大一些。值得一提的是,该框架包含多个日期时间列或多个VMS 时间戳,以及字符串、字符和整数列。所有非对象列都可以并且确实有缺失值。
起初我以为我将“NA”值保存在“对象类型”列之一中。然后我尝试更新到最新的熊猫版本(0.9.1)。到目前为止没有任何效果。
我已经能够使用以下代码重现该错误:
import pandas as pd
import numpy as np
import datetime
# Get VMS timestamps for today
time_now = datetime.datetime.today()
start_vms = datetime.datetime(1858, 11, 17)
t_delta = (time_now - start_vms)
vms_time = t_delta.total_seconds() * 10000000
# Generate Test Frame (dense)
test_records = []
vms_time1 = vms_time
vms_time2 = vms_time
for i in range(2000000):
vms_time1 += 15 * np.random.randn()
vms_time2 += 25 * np.random.randn()
vms_time_diff = vms_time2 - vms_time1
string1 = 'XXXXXXXXXX'
string2 = 'XXXXXXXXXX'
string3 = 'XXXXX'
string4 = 'XXXXX'
char1 = 'A'
char2 = 'B'
char3 = 'C'
char4 = 'D'
number1 = np.random.randint(1,10)
number2 = np.random.randint(1,100)
number3 = np.random.randint(1,1000)
test_records.append((char1, string1, vms_time1, number1, char2, string2, vms_time2, number2, char3, string3, vms_time_diff, number3, char4, string4))
df = pd.DataFrame(test_records, columns = ["column_1", "column_2", "column_3", "column_4", "column_5", "column_6", "column_7", "column_8", "column_9", "column_10", "column_11", "column_12", "column_13", "column_14"])
# Generate Test Frame (sparse)
test_records = []
vms_time1 = vms_time
vms_time2 = vms_time
count = 0
for i in range(2000000):
if (count%23 == 0):
vms_time1 += 15 * np.random.randn()
string1 = 'XXXXXXXXXX'
string2 = ' '
string3 = 'XXXXX'
string4 = 'XXXXX'
char1 = 'A'
char2 = 'B'
char3 = 'C'
char4 = 'D'
number1 = None
number2 = np.random.randint(1,100)
number3 = np.random.randint(1,1000)
test_records.append((char1, string1, vms_time1, number1, char2, None, None, number2, char3, string3, None, number3, None, string4))
else:
vms_time1 += 15 * np.random.randn()
vms_time2 += 25 * np.random.randn()
vms_time_diff = vms_time2 - vms_time1
string1 = 'XXXXXXXXXX'
string2 = 'XXXXXXXXXX'
string3 = 'XXXXX'
string4 = 'XXXXX'
char1 = 'A'
char2 = 'B'
char3 = 'C'
char4 = 'D'
number1 = np.random.randint(1,10)
number2 = np.random.randint(1,100)
number3 = np.random.randint(1,1000)
test_records.append((char1, string1, vms_time1, number1, char2, string2, vms_time2, number2, char3, string3, vms_time_diff, number3, char4, string4))
count += 1
df1 = pd.DataFrame(test_records, columns = ["column_1", "column_2", "column_3", "column_4", "column_5", "column_6", "column_7", "column_8", "column_9", "column_10", "column_11", "column_12", "column_13", "column_14"])
store_loc = "Some Location for the file"
h5_store = pd.HDFStore(store_loc )
h5_store['df1'] = df
h5_store['df2'] = df1
h5_store.close()
当我现在尝试从该商店加载时,“df1”表现正常,但“df2”产生以下错误:
TypeError: 'NoneType' object is not iterable