python - Pandas 列中的多维数组

Question

我的数据将由许多属性组成，这些属性可以由任意长度的数组描述（例如，一个对象可以包含一定数量的集群，我想将每个组成集群的大小存储为一列，但是数量原则上，每个对象的簇的范围可以从 0 到 \infty）。有没有办法支持任何长度的数组作为 Pandas 数据框中的列数据？我意识到我可以使用面板，但 AFAIK 需要知道面板的深度（原则上我在加载数据之前无法知道），此外，面板可能非常稀疏，因为在示例中，许多对象可能只有很少的簇。

如果我只使用 dtype=object 的 numpy 数组，是否会对存储在 H5Store 或 Pandas 选择或其他任何东西的性能产生影响？

score 0 · Accepted Answer

Instead of varying number of columns per object you would have varying number of rows per object

pd.DataFrame({'ClusterID' : '1a,1b,2a,2b,2c,2d,3a'.split(','), 'ObjectID' : [1,1,2,2,2,2,3]})
  ObjectID  ClusterID
0        1         1a
1        1         1b
2        2         2a
3        2         2b
4        2         2c
5        2         2d
6        3         3a

If each cluster has multiple attributes, you could store them in a separate table as below. This would allow multiple objects to share clusters without having to replicate data

pd.DataFrame({'ClusterID' : '1a,1b,2a,2b,2c,2d,3a'.split(','), 'ClusterAttr-1' : 'Attr-1', 'ClusterAttr-2' : 'Attr-2'})
  ClusterID ClusterAttr-1 ClusterAttr-2
0        1a        Attr-1        Attr-2
1        1b        Attr-1        Attr-2
2        2a        Attr-1        Attr-2
3        2b        Attr-1        Attr-2
4        2c        Attr-1        Attr-2
5        2d        Attr-1        Attr-2
6        3a        Attr-1        Attr-2

python - Pandas 列中的多维数组

1 回答 1

Related

Reference