python - Python重复删除

Question

我有一个关于在 Python 中删除重复项的问题。我已经阅读了很多帖子，但还没有能够解决它。我有以下 csv 文件：

编辑

输入：

ID, Source, 1.A, 1.B, 1.C, 1.D
1, ESPN, 5,7,,,M
1, NY Times,,10,12,W
1, ESPN, 10,,Q,,M

输出应该是：

ID, Source, 1.A, 1.B, 1.C, 1.D, duplicate_flag
1, ESPN, 5,7,,,M, duplicate
1, NY Times,,10,12,W, duplicate
1, ESPN, 10,,Q,,M, duplicate 
1, NY Times, 5 (or 10 doesn't matter which one),7, 10, 12, W, not_duplicate

换句话说，如果 ID 相同，则从源“NY Times”的行中获取值，如果“NY Times”的行具有空白值并且来自“ESPN”源的重复行具有该单元格的值，从具有“ESPN”源的行中获取值。对于输出，将原来的两行标记为重复并创建第三行。

为了进一步澄清，由于我需要在具有不同列标题的许多不同 csv 文件上运行此脚本，因此我不能执行以下操作：

    def main():
        with open(input_csv, "rb") as infile:
            input_fields = ("ID", "Source", "1.A", "1.B", "1.C", "1.D")
            reader = csv.DictReader(infile, fieldnames = input_fields)
            with open(output_csv, "wb") as outfile:
                output_fields = ("ID", "Source", "1.A", "1.B", "1.C", "1.D", "d_flag")
                writer = csv.DictWriter(outfile, fieldnames = output_fields)
                writer.writerow(dict((h,h) for h in output_fields))
                next(reader)
                first_row = next(reader)
                for next_row in reader:
                    #stuff

因为我希望程序独立于表中的任何其他列在前两列上运行。换句话说，“ID”和“Source”将在每个输入文件中，但其余列将根据文件而变化。

非常感谢您提供的任何帮助！仅供参考，“来源”只能是：纽约时报、ESPN 或华尔街日报，重复的优先顺序是：如果有，请选择纽约时报，否则选择 ESPN，否则选择华尔街日报。这适用于每个输入文件。

score 2 · Accepted Answer

下面的代码将所有记录读入一个大字典，其键是它们的标识符，其值是将源名称映射到整个数据行的字典。然后它遍历字典并为您提供您要求的输出。

import csv

header = None
idfld = None
sourcefld = None

record_table = {}

with open('input.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        row = [x.strip() for x in row]

        if header is None:
            header = row
            for i, fld in enumerate(header):
                if fld == 'ID':
                    idfld = i
                elif fld == 'Source':
                    sourcefld = i
            continue

        key = row[idfld]
        sourcename = row[sourcefld]

        if key not in record_table:
            record_table[key] = {sourcename: row, "all_rows": [row]}
        else:
            if sourcename in record_table[key]:
                cur_row = record_table[key][sourcename]
                for i, fld in enumerate(row):
                    if cur_row[i] == '':
                        record_table[key][sourcename][i] = fld
            else:
                record_table[key][sourcename] = row
            record_table[key]["all_rows"].append(row)

print ', '.join(header) + ', duplicate_flag'

for recordid in record_table:
    rowdict = record_table[recordid]

    final_row = [''] * len(header)

    rowcount = len(rowdict)

    for sourcetype in ['NY Times', 'ESPN', 'Wall Street Journal']:
        if sourcetype in rowdict:
            row = rowdict[sourcetype]
            for i, fld in enumerate(row):
                if final_row[i] != '':
                    continue
                if fld != '':
                    final_row[i] = fld

    if rowcount > 1:
        for row in rowdict["all_rows"]:
            print ', '.join(row) + ', duplicate'

    print ', '.join(final_row) + ', not_duplicate'

python - Python重复删除

1 回答 1

Related

Reference