python - 从给定多索引的熊猫数据框中查找

Question

在使用 pandas 玩 kaggle titanic 数据集时，我发现一个地方我在 python 中编写了一个显式循环，但我想知道是否有更有效的方法？考虑以下程序：

#!/usr/bin/python
import pandas as pd

# Assume that we have a dataframe with three fields
f = pd.DataFrame([ (0,1,1),
                   (1,0,0),
                   (1,1,1),
                   (1,1,0),
                   ],
                 columns=list('ABY'))

# and a multi index of A,B
idx = pd.MultiIndex.from_product([(0,1),(0,1)],
                                 names=list('AB'))

# For each idx I want a list of the values of F.Y for which A and B match. This
# can be done through the following loop:
e = []
for a,b in idx:
  e += [list(f.Y[(f.A==a) & (f.B==b)])]

s = pd.Series(e, index=idx, name='Y')
print s

# Yields:
# A  B
# 0  0        []
#    1       [1]
# 1  0       [0]
#    1    [1, 0]
# Name: Y, dtype: object

我的问题是是否可以在s没有循环的情况下生成？

score 2 · Accepted Answer

这是一种几乎相同的方式：

>>> f.groupby(["A", "B"])["Y"].apply(list).ix[idx]
A  B
0  0       NaN
   1       [1]
1  0       [0]
   1    [1, 0]
dtype: object

唯一的区别是，在没有匹配的情况下，这会给出 NaN 而不是空列表。不幸的是，由于这个问题，您不能用fillna空列表替换 NaN 。但是，您可以使用删除它，并且在许多情况下，对于没有匹配项的情况，您实际上并不需要空项目。dropna

python - 从给定多索引的熊猫数据框中查找

1 回答 1

Related

Reference