0

我正在尝试从如下所示的数据中删除重复条目:

name    phone   email   website
Diane Grant Albrecht M.S.           
Lannister G. Cersei M.A.T., CEP 111-222-3333    cersei@got.com  www.got.com
Argle D. Bargle Ed.M.           
Sam D. Man Ed.M.    000-000-1111    dman123@gmail.com   www.daManWithThePlan.com
Sam D. Man Ed.M.            
Sam D. Man Ed.M.    111-222-333     dman123@gmail.com   www.daManWithThePlan.com
D G Bamf M.S.           
Amy Tramy Lamy Ph.D.            

所以它看起来像这样:

name    phone   email   website
Diane Grant Albrecht M.S.           
Lannister G. Cersei M.A.T., CEP 111-222-3333    cersei@got.com  www.got.com
Argle D. Bargle Ed.M.           
Sam D. Man Ed.M.    000-000-1111, 111-222-333   dman123@gmail.com   www.daManWithThePlan.com
D G Bamf M.S.           
Amy Tramy Lamy Ph.D.    

这是我的代码:

from collections import defaultdict
import csv
import re

input = open('ieca_first_col_fake_text.txt', 'rU')

# default to empty set for phone, email, website, area, degrees
extracted_data = defaultdict(lambda: [set(), set(), set()])

for row in input:
    for index, value in enumerate(row):    
        name = row[0]
        data = extracted_data[name].add(row)

for row in data: print row

我收到此错误:

AttributeError: 'list' object has no attribute 'add'
logout

更新:

from collections import defaultdict
import csv
import re

input = open('ieca_first_col_fake_text.txt', 'rU')
input_r = csv.reader(input, delimiter = '\t')

# default to empty set for phone, email, website, area, degrees
extracted_data = defaultdict(lambda: [set(), set(), set()])

data = []

# Index on the name and then for that name add the rest of the information. 
for row in input_r:

    data_set = extracted_data[row[0]]
    for index, value in enumerate(row[1:]):
        data_set[index].add(value)

print data_set

输出:

[set(['']), set(['']), set([''])]
logout
4

1 回答 1

3

extracted_data值是每组 3 组的列表

extracted_data = defaultdict(lambda: [set(), set(), set()])

您需要更仔细地阅读上一个答案并选择正确的集合来调用.add()

前面的答案循环输入行中的 4 个元素,使用第一个元素查找集合列表,并将其他 3 个元素中的每一个添加到这些集合中:

for index, value in enumerate(split(entry)):
    if index == 0:
        data_set = extracted_data[name]
    elif value:
        data_set[index - 1].add(value)

就个人而言,我会使用:

entry = entry.split()  # split on whitespace
for value, dset in zip(entry[1:], extracted_data[entry[0]]):
    dset.add(value)

达到同样的目的。

于 2013-07-08T19:52:45.283 回答