python - 具有可变长度多索引的熊猫数据框用 NaN 替换值

Question

我正在使用来自调查的相当复杂的数据集的 Pandas 表示。到目前为止，似乎具有多索引的一维变量系列最适合存储和使用这些数据。

每个变量名称都由一个“路径”组成，以唯一标识该特定响应。这些路径的长度不同。我试图弄清楚我是否误解了分层索引应该如何工作，或者我是否遇到了错误。在将较短的索引加入数据集时，似乎 Pandas 将较短的索引“填充”到最大长度，并在此过程中破坏了该值。

例如，此测试失败：

def test_dataframe_construction1(self):
    case1 = pd.Series(True, pd.MultiIndex.from_tuples([
        ('a1', 'b1', 'c1'),
        ('a2', 'b2', 'c2', 'd1', 'e1'),
        ]))
    case2 = pd.Series(True, pd.MultiIndex.from_tuples([
        ('a3', 'b3', 'c3'),
        ('a4', 'b4', 'c4', 'd2', 'e2'),
        ]))
    df = pd.DataFrame({
        'case1': case1,
        'case2': case2
    })
    logger.debug(df)
    self.assertEquals(df['case1'].loc['a1'].any(), True)

并打印：

a1 b1 c1 nan nan   NaN   NaN
a2 b2 c2 d1  e1   True   NaN
a3 b3 c3 nan nan   NaN   NaN
a4 b4 c4 d2  e2    NaN  True

有趣的是，用空字符串而不是 NaN 填充“较短”索引会导致我期望的行为：

def test_dataframe_construction2(self):
    case1 = pd.Series(True, pd.MultiIndex.from_tuples([
        ('a1', 'b1', 'c1', '', ''),
        ('a2', 'b2', 'c2', 'd1', 'e1'),
    ]))
    case2 = pd.Series(True, pd.MultiIndex.from_tuples([
        ('a3', 'b3', 'c3', '', ''),
        ('a4', 'b4', 'c4', 'd2', 'e2'),
    ]))
    df = pd.DataFrame({
        'case1': case1,
        'case2': case2
    })
    logger.debug(df)
    self.assertEquals(df['case1'].loc['a1'].any(), True)

并打印：

                case1 case2
a1 b1 c1        True   NaN
a2 b2 c2 d1 e1  True   NaN
a3 b3 c3         NaN  True
a4 b4 c4 d2 e2   NaN  True

我在这里想念什么？谢谢！

score 1 · Accepted Answer

避免在索引中使用 NaN。除此之外，您需要不同的模式来表示路径/案例/数据之间的关系。您需要可变数量的 MultiIndex 级别这一事实是一个强有力的提示，而且案例列看起来只使用几个路径。我会将节点、路径和案例数据拆分到单独的 DataFrame 中。在下面的示例中，我展示了如何表示 case1 的第一条路径。

import pandas as pd
from itertools import product

node_names = ['%s%d' % t for t in product('abcd', range(1, 5))]
nodes = pd.DataFrame({'node': node_names})
nodes.index.name = 'id'

path_nodes = pd.DataFrame({'path_id': [0, 0, 0],
                           'node_id': [0, 4, 8],
                           'position':[0, 1, 2]})

data = pd.DataFrame({'path_id': [0],
                     'case': [1],
                     'data': [True]})
In [113]: nodes
Out[113]: 
   node
id     
0    a1
1    a2
2    a3
3    a4
4    b1
5    b2
6    b3
7    b4
8    c1
...

In [114]: path_nodes
Out[114]: 
   node_id  path_id  position
0        0        0         0
1        4        0         1
2        8        0         2

In [115]: data
Out[115]: 
   case  data  path_id
0     1  True        0

python - 具有可变长度多索引的熊猫数据框用 NaN 替换值

1 回答 1

Related

Reference