0

(编辑以澄清我的申请,如有任何混淆,请见谅)

我进行了一个分成试验的实验。每次试验都可能产生无效数据或有效数据。当存在有效数据时,数据采用长度为零的数字列表的形式。

所以一个无效的试验产生None,一个有效的试验可以产生[]等等[1,2]

理想情况下,我希望能够将此数据保存为frame_table(调用它data)。我有另一个表(称为它trials),它很容易转换为 aframe_table并用作 aselector来提取行(试验)。然后我想使用select_as_multiple.

现在,data当我使用object数组时,我将结构保存为常规表。我意识到人们说这是低效的,但我想不出一种有效的方法来处理data.

我知道我可以使用 NaN 并制作一个(可能非常宽)表,其最大宽度是我的数据数组的最大长度,但是我需要一种不同的机制来标记无效试验。包含所有 NaN 的行令人困惑——这是否意味着我进行了零长度数据试验,或者我进行了无效试验?

我认为使用 Pandas 没有很好的解决方案。NaN 解决方案将我引导到可能非常宽的表和一个标记有效/无效试验的附加列

如果我使用数据库,我会创建data一个二进制 blob 列。使用 Pandas,我目前的工作解决方案是在常规框架中保存dataobject数组并将其全部加载,然后根据我的trials表提取相关索引。

这有点低效,因为我一口气读完了整张data桌子,但这是我想出的最可行/可扩展的方案。

但我最热情地欢迎更规范的解决方案。

非常感谢您的所有时间!

编辑:添加代码(杰夫的建议)

import pandas as pd, numpy
mydata = [numpy.empty(n) for n in range(1,11)]

df = pd.DataFrame(mydata)

In [4]: df
Out[4]: 
                                                   0
0                               [1.28822975392e-231]
1           [1.28822975392e-231, -2.31584192385e+77]
2  [1.28822975392e-231, -1.49166823584e-154, 2.12...
3  [1.28822975392e-231, 1.2882298313e-231, 2.1259...
4  [1.28822975392e-231, 1.72723381477e-77, 2.1259...
5  [1.28822975392e-231, 1.49166823584e-154, 1.531...
6  [1.28822975392e-231, -2.68156174706e+154, 2.20...
7  [1.28822975392e-231, -2.68156174706e+154, 2.13...
8  [1.28822975392e-231, -1.3365130604e-315, 2.222...
9  [1.28822975392e-231, -1.33651054067e-315, 2.22...

In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
0    10  non-null values
dtypes: object(1)

df.to_hdf('test.h5','data')
--> OK

df.to_hdf('test.h5','data1',table=True)
--> ...
TypeError: Cannot serialize the column [0] because
its data contents are [mixed] object dtype
4

1 回答 1

1

这是一个与您描述的内容相同的简单示例

In [17]: df = DataFrame(randn(10,10))

In [18]: df.iloc[5:10,7:9] = np.nan

In [19]: df.iloc[7:10,4:9] = np.nan

In [22]: df.iloc[7:10,-1] = np.nan

In [23]: df
Out[23]: 
          0         1         2         3         4         5         6         7         8         9
0 -1.671523  0.277972 -1.217315 -1.390472  0.944464 -0.699266  0.348579  0.635009 -0.330561 -0.121996
1  0.239482 -0.050869  0.488322 -0.668864  0.125534 -0.159154  1.092619 -0.638932 -0.091755  0.291824
2  0.432216 -1.101879  2.082755 -0.500450  0.750278 -1.960032 -0.688064 -0.674892  3.225115  1.035806
3  0.775353 -1.320165 -0.180931  0.342537  2.009530  0.913223  0.581071 -1.111551  1.118720 -0.081520
4 -0.255524  0.143255 -0.230755 -0.306252  0.748510  0.367886 -1.032118  0.232410  1.415674 -0.420789
5 -0.850601  0.273439 -0.272923 -1.248670  0.041129  0.506832  0.878972       NaN       NaN  0.433333
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417  0.504146       NaN       NaN -0.635012
7 -0.241512  0.159100  0.223019 -0.750034       NaN       NaN       NaN       NaN       NaN       NaN
8 -1.511968 -0.391903  0.257445 -1.642250       NaN       NaN       NaN       NaN       NaN       NaN
9 -0.376762  0.977394  0.760578  0.964489       NaN       NaN       NaN       NaN       NaN       NaN

In [24]: df['stop'] = df.apply(lambda x: x.last_valid_index(), 1)

In [25]: df
Out[25]: 
          0         1         2         3         4         5         6         7         8         9  stop
0 -1.671523  0.277972 -1.217315 -1.390472  0.944464 -0.699266  0.348579  0.635009 -0.330561 -0.121996     9
1  0.239482 -0.050869  0.488322 -0.668864  0.125534 -0.159154  1.092619 -0.638932 -0.091755  0.291824     9
2  0.432216 -1.101879  2.082755 -0.500450  0.750278 -1.960032 -0.688064 -0.674892  3.225115  1.035806     9
3  0.775353 -1.320165 -0.180931  0.342537  2.009530  0.913223  0.581071 -1.111551  1.118720 -0.081520     9
4 -0.255524  0.143255 -0.230755 -0.306252  0.748510  0.367886 -1.032118  0.232410  1.415674 -0.420789     9
5 -0.850601  0.273439 -0.272923 -1.248670  0.041129  0.506832  0.878972       NaN       NaN  0.433333     9
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417  0.504146       NaN       NaN -0.635012     9
7 -0.241512  0.159100  0.223019 -0.750034       NaN       NaN       NaN       NaN       NaN       NaN     3
8 -1.511968 -0.391903  0.257445 -1.642250       NaN       NaN       NaN       NaN       NaN       NaN     3
9 -0.376762  0.977394  0.760578  0.964489       NaN       NaN       NaN       NaN       NaN       NaN     3

请注意,在 0.12 中您应该使用table=True, 而不是fmt(这是在更改过程中)

In [26]: df.to_hdf('test.h5','df',mode='w',fmt='t')

In [27]: pd.read_hdf('test.h5','df')
Out[27]: 
          0         1         2         3         4         5         6         7         8         9  stop
0 -1.671523  0.277972 -1.217315 -1.390472  0.944464 -0.699266  0.348579  0.635009 -0.330561 -0.121996     9
1  0.239482 -0.050869  0.488322 -0.668864  0.125534 -0.159154  1.092619 -0.638932 -0.091755  0.291824     9
2  0.432216 -1.101879  2.082755 -0.500450  0.750278 -1.960032 -0.688064 -0.674892  3.225115  1.035806     9
3  0.775353 -1.320165 -0.180931  0.342537  2.009530  0.913223  0.581071 -1.111551  1.118720 -0.081520     9
4 -0.255524  0.143255 -0.230755 -0.306252  0.748510  0.367886 -1.032118  0.232410  1.415674 -0.420789     9
5 -0.850601  0.273439 -0.272923 -1.248670  0.041129  0.506832  0.878972       NaN       NaN  0.433333     9
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417  0.504146       NaN       NaN -0.635012     9
7 -0.241512  0.159100  0.223019 -0.750034       NaN       NaN       NaN       NaN       NaN       NaN     3
8 -1.511968 -0.391903  0.257445 -1.642250       NaN       NaN       NaN       NaN       NaN       NaN     3
9 -0.376762  0.977394  0.760578  0.964489       NaN       NaN       NaN       NaN       NaN       NaN     3
于 2013-08-29T19:27:29.900 回答