python - 如何删除数据框中值顺序不重要的行

Question

我有一个这样的数据框：

source   target   weight
     1       2         5
     2       1         5
     1       2         5
     1       2         7
     3       1         6
     1       1         6
     1       3         6

我的目标是删除重复的行，但源列和目标列的顺序并不重要。事实上，两列的顺序并不重要，它们应该被删除。在这种情况下，预期的结果将是

source   target   weight
     1       2         5
     1       2         7
     3       1         6
     1       1         6

没有循环有什么办法吗？

score 4 · Accepted Answer

使用frozenset和duplicated

df[~df[['source', 'target']].apply(frozenset, 1).duplicated()]

   source  target  weight
0       1       2       5
3       3       1       6
4       1       1       6

如果你想考虑无序的source/target和weight

df[~df[['weight']].assign(A=df[['source', 'target']].apply(frozenset, 1)).duplicated()]

   source  target  weight
0       1       2       5
3       1       2       7
4       3       1       6
5       1       1       6

但是，要明确地使用更具可读性的代码。

# Create series where values are frozensets and therefore hashable.
# With hashable things, we can determine duplicity.
# Note that I also set the index and name to set up for a convenient `join`
s = pd.Series(list(map(frozenset, zip(df.source, df.target))), df.index, name='mixed')

# Use `drop` to focus on just those columns leaving whatever else is there.
# This is more general and accommodates more than just a `weight` column.
mask = df.drop(['source', 'target'], axis=1).join(s).duplicated()

df[~mask]

   source  target  weight
0       1       2       5
3       1       2       7
4       3       1       6
5       1       1       6

score 0 · Accepted Answer

应该相当容易。

data = [[1,2,5],
[2,1,5],
[1,2,5],
[3,1,6],
[1,1,6],
[1,3,6],
]
df = pd.DataFrame(data,columns=['source','target','weight'])

您可以使用删除重复项 drop_duplicates

df = df.drop_duplicates(keep=False)
print(df)

会导致：

      source  target  weight
1       2       1       5
3       3       1       6
4       1       1       6
5       1       3       6

因为您要处理无序的源/目标问题。

def pair(row):
    sorted_pair = sorted([row['source'],row['target']])
    row['source'] =  sorted_pair[0]
    row['target'] = sorted_pair[1]
    return row
df = df.apply(pair,axis=1)

然后你可以使用df.drop_duplicates()

   source  target  weight
0       1       2       5
3       1       2       7
4       1       3       6
5       1       1       6

python - 如何删除数据框中值顺序不重要的行

2 回答 2

Related

Reference