2

我创建了两个 CSV 列表。一个是原始 CSV 文件,另一个是该文件的 DeDuped 版本。我已将每一个读入一个列表,并且出于所有意图和目的,它们的格式相同。每个列表项都是一个字符串。

我正在尝试使用列表理解来找出重复删除了哪些项目。原始长度为 16939,DeDupe 列表为 15368。相差 1571,但我的列表理解长度为 368。想法?

deduped = open('account_de_ex.csv', 'r')
deduped_data = deduped.read()
deduped.close()
deduped = deduped_data.split("\r")

#read in file with just the account names from the full account list
account_names = open('account_names.csv', 'r')
account_data = account_names.read()
account_names.close()
account_names = account_data.split("\r")

# Get all the accounts that were deleted in the dedupe - i.e. get the duplicate accounts
dupes = [ele for ele in account_names if ele not in deduped]

编辑:对于评论中的一些注释,这是对我的列表组合和列表本身的测试。几乎相同的差异,20左右。不是我需要的1500!谢谢!

print len(deduped)
deduped = set(deduped)
print len(deduped)

print len(account_names)
account_names = set(account_names)
print len(account_names)


15368
15368
16939
15387
4

2 回答 2

2

尝试运行此代码并查看它报告的内容。这需要 Python 2.7 或更高版本,collections.Counter但您可以轻松编写自己的计数器代码,或从另一个答案复制我的示例代码:Python : dict 列表,如果存在则增加 dict 值,如果不附加新 dict

from collections import Counter

# read in original records
with open("account_names.csv", "rt") as f:
    rows = sorted(line.strip() for line in f)

# count how many times each row appears
counts = Counter(rows)

# get a list of tuples of (count, row) that only includes count > 1
dups = [(count, row) for row, count in counts.items() if count > 1]
dup_count = sum(count-1 for count in counts.values() if count > 1)

# sort the list from largest number of dups to least
dups.sort(reverse=True)

# print a report showing how many dups
for count, row in dups:
    print("{}\t{}".format(count, row))

# get de-duped list
unique_rows = sorted(counts)

# read in de-duped list
with open("account_de_ex.csv", "rt") as f:
    de_duped = sorted(line.strip() for line in f)

print("List lengths: rows {}, uniques {}/de_duped {}, result {}".format(
        len(rows), len(unique_rows), len(de_duped), len(de_duped) + dup_count))

# lists should match since we sorted both lists
if unique_rows == de_duped:
    print("perfect match!")
else:
    # if lists don't match, find out what is going on
    uniques_set = set(unique_rows)
    deduped_set = set(de_duped)

    # find intersection of the two sets
    x = uniques_set.intersection(deduped_set)

    # print differences
    if x != uniques_set:
        print("Rows in original that are not in deduped:\n{}".format(sorted(uniques_set - x)))
    if x != deduped_set:
        print("Rows in deduped that are not in original:\n{}".format(sorted(deduped_set - x)))
于 2013-10-30T01:24:09.300 回答
0

要查看您在每个列表中真正拥有的内容,您可以通过构建进行:

如果您只有独特的元素:

deduped = range(15368)
account_names2 = range(15387)
dupes2 = [ele for ele in account_names2 if ele not in deduped] #len is 19

但是,因为您重复删除和未删除的元素,您实际上最终会得到:

account_names =account_names2 + dupes2*18 + dupes2[:7] + account_names2[:1571  - 368]
dupes = [ele for ele in account_names if ele not in deduped] # dupes will have 368 elements 
于 2013-10-29T23:22:55.963 回答