python - Pandas：使用 groupby 和函数进行 DataFrame 过滤

Question

使用 Python 3.3 和 Pandas 0.10

我有一个由连接多个 CSV 文件构建的 DataFrame。首先，我过滤掉 Name 列中包含某个字符串的所有值。结果看起来像这样（为简洁起见缩短，实际上有更多列）：

Name    ID
'A'     1
'B'     2
'C'     3
'C'     3
'E'     4
'F'     4
...     ...

现在我的问题是我想删除“重复”值的特殊情况。我想删除映射到此 ID 的相应名称值不相似的所有 ID 重复项（实际上是整行）。在上面的示例中，我想保留 ID 为 1、2 和 3 的行。其中 ID=4 的 Name 值不相等，我想删除它们。

我尝试使用以下代码行（基于此处的建议：Python Pandas: remove entries based on the number of occurrences）。

代码：

df[df.groupby('ID').apply(lambda g: len({x for x in g['Name']})) == 1]

但是，这给了我错误： ValueError: Item wrong length 51906 instead of 109565!

编辑：

而不是使用apply()我也尝试过使用transform()，但是这给了我错误：AttributeError: 'int' object has no attribute 'ndim'. 非常感谢解释为什么每个函数的错误不同！

另外，我想保留上面示例中 ID = 3 的所有行。

在此先感谢，马蒂斯

score 5 · Accepted Answer

而不是 length len，我认为您想考虑每组中 Name 的唯一值的数量。使用nunique()并查看这个用于过滤组的简洁配方。

df[df.groupby('ID').Name.transform(lambda x: x.nunique() == 1).astype('bool')]

如果升级到 pandas 0.12，可以filter在组上使用新方法，这使得这更加简洁明了。

df.groupby('ID').filter(lambda x: x.Name.nunique() == 1)

一般性评论：当然，有时您确实想知道组的长度，但我发现这size是一个比更安全的选择len，这在某些情况下对我来说很麻烦。

score 0 · Accepted Answer

You could first drop the duplicates:

In [11]: df = df.drop_duplicates()

In [12]: df
Out[12]:
  Name ID
0    A  1
1    B  2
2    C  3
4    E  4
5    F  4

The groupby id and only consider those with one element:

In [13]: g = df.groupby('ID')

In [14]: size = (g.size() == 1)

In [15]: size
Out[15]:
ID
1      True
2      True
3      True
4     False
dtype: bool

In [16]: size[size].index
Out[16]: Int64Index([1, 2, 3], dtype=int64)

In [17]: df['ID'].isin(size[size].index)
Out[17]:
0     True
1     True
2     True
4    False
5    False
Name: ID, dtype: bool

And boolean index by this:

In [18]: df[df['ID'].isin(size[size].index)]
Out[18]:
  Name ID
0    A  1
1    B  2
2    C  3

python - Pandas：使用 groupby 和函数进行 DataFrame 过滤

2 回答 2

Related

Reference