python-3.x - 如何根据熊猫数据框中的部分匹配来隔离重复项

Question

我有一个熊猫数据框，如下所示：

email                   col2  col3
email@example.com       John  Doe
xxxemail@example.com    John  Doe
xxemail@example.com     John  Doe
xxxxxemail@example.com  John  Doe
xxxemail@example2.com   Jane  Doe

我想检查每个以至少两个“x”开头的电子邮件地址，并检查是否存在没有这些“x”的相同电子邮件地址。

要求的结果：

email                   col2  col3  exists_in_valid_form
email@example.com       John  Doe   False
xxxemail@example.com    John  Doe   True
xxemail@example.com     John  Doe   True
xxxxxemail@example.com  John  Doe   True
xxxemail@example2.com   Jane  Doe   False

我能够获得一个包含所有这些行的子数据框，其中电子邮件以 'xx' using 开头df[df['email'].str.contains("xx")]，并且还能够在没有 'x' using 的情况下获得电子邮件地址str.lstrip('x')，但似乎都不能帮助我了解是否这封电子邮件出现在没有那些 x 的其他地方。

score 1 · Accepted Answer

您可以使用duplicated()来获取值是否存在于其他行中。

df['exists_in_valid_form'] = df.email.str.lstrip('x').duplicated(keep=False) & df.email.str.startswith('xx')

我添加df.email.str.startswith('xx')以确保它应该以至少 2 个“x”开头并为“xemail@example.com”返回 False。

python-3.x - 如何根据熊猫数据框中的部分匹配来隔离重复项

1 回答 1

Related

Reference