我创建了两个 CSV 列表。一个是原始 CSV 文件,另一个是该文件的 DeDuped 版本。我已将每一个读入一个列表,并且出于所有意图和目的,它们的格式相同。每个列表项都是一个字符串。
我正在尝试使用列表理解来找出重复删除了哪些项目。原始长度为 16939,DeDupe 列表为 15368。相差 1571,但我的列表理解长度为 368。想法?
deduped = open('account_de_ex.csv', 'r')
deduped_data = deduped.read()
deduped.close()
deduped = deduped_data.split("\r")
#read in file with just the account names from the full account list
account_names = open('account_names.csv', 'r')
account_data = account_names.read()
account_names.close()
account_names = account_data.split("\r")
# Get all the accounts that were deleted in the dedupe - i.e. get the duplicate accounts
dupes = [ele for ele in account_names if ele not in deduped]
编辑:对于评论中的一些注释,这是对我的列表组合和列表本身的测试。几乎相同的差异,20左右。不是我需要的1500!谢谢!
print len(deduped)
deduped = set(deduped)
print len(deduped)
print len(account_names)
account_names = set(account_names)
print len(account_names)
15368
15368
16939
15387