12

我有一个这样的 csv 文件:

column1    column2

john       kerry
adam       stephenson
ashley     hudson
john       kerry
etc..

我想从此文件中删除重复项,仅获取:

column1    column2

john       kerry
adam       stephenson
ashley     hudson

我编写了这个脚本,它根据姓氏删除重复项,但我需要根据姓氏和名字删除重复项。

import csv

reader=csv.reader(open('myfilewithduplicates.csv', 'r'), delimiter=',')
writer=csv.writer(open('myfilewithoutduplicates.csv', 'w'), delimiter=',')

lastnames = set()
for row in reader:
    if row[1] not in lastnames:
        writer.writerow(row)
        lastnames.add( row[1] )
4

3 回答 3

20

你真的很亲近。使用这些列作为集合条目

entries = set()

for row in reader:
   key = (row[0], row[1]) # instead of just the last name

   if key not in entries:
      writer.writerow(row)
      entries.add(key)
于 2012-10-12T01:50:03.410 回答
12

您现在可以在 pandas 中使用 .drop_duplicates 方法。我会做以下事情:

import pandas as pd
toclean = pd.read_csv('myfilewithduplicates.csv')
deduped = toclean.drop_duplicates([col1,col2])
deduped.to_csv('myfilewithoutduplicates.csv')
于 2013-06-13T02:29:25.773 回答
1

一种快速的方法是使用以下技术创建一组唯一的行(从本文中的@CedricJulien 采用。您失去了DictWriter将列名存储在每一行中的好处,但它应该适用于您的情况:

>>> import csv
>>> with open('testcsv1.csv', 'r') as f:
...   reader = csv.reader(f)
...   uniq = [list(tup) for tup in set([tuple(row) for row in reader])]
...
>>> with open('nodupes.csv', 'w') as f:
...   writer=csv.writer(f)
...   for row in uniq:
...     writer.writerow(row)

这使用了@CedricJulien 使用的相同技术,这是一个很好的单行删除重复行(定义为相同的名字和姓氏)。这使用DictReader/DictWriter类:

>>> import csv
>>> with open('testcsv1.csv', 'r') as f:
...   reader = csv.DictReader(f)
...   rows = [row for row in reader]
...
>>> uniq = [dict(tup) for tup in set(tuple(person.items()) for person in rows)]
>>> with open('nodupes.csv', 'w') as f:
...   headers = ['column1', 'column2']
...   writer = csv.DictWriter(f, fieldnames=headers)
...   writer.writerow(dict((h, h) for h in headers))

...   for row in uniq:
...     writer.writerow(row)
...
于 2012-10-12T01:36:16.997 回答