2

如何根据两列从 csv 文件中删除重复行,其中一个列使用正则表达式确定匹配并按第一个字段(IPAddress)分组。最后在行中添加一个计数字段来计算重复行:

.csv 文件:

IPAddress, Value1, Value2, Value3
127.0.0.1, Test1ABC, 10, 20
127.0.0.1, Test2ABC, 20, 30
127.0.0.1, Test1ABA, 30, 40
127.0.0.1, Value1BBA, 40, 50
127.0.0.1, Value1BBA, 40, 50
127.0.0.2, Test1ABC, 10, 20
127.0.0.2, Value1AAB, 20, 30
127.0.0.2, Value2ABA, 30, 40
127.0.0.2, Value1BBA, 40, 50

我想在 IPAddress 和 Value1 上进行匹配(如果前 5 个字符匹配,则 Value1 是匹配的)。

这会给我:

IPAddress, Value1, Value2, Value3, Count
127.0.0.1, Test1ABC, 10, 20, 2
127.0.0.1, Test2ABC, 20, 30, 1
**127.0.0.1, Test1ABA, 30, 40** (Line would be removed but counted)
127.0.0.1, Value1BBA, 40, 50, 2
**127.0.0.1, Value1BBA, 40, 50** (Line would be removed but counted)
127.0.0.2, Test1ABC, 10, 20, 1
127.0.0.2, Value1AAB, 20, 30, 2
127.0.0.2, Value2ABA, 30, 40, 1
**127.0.0.2, Value1BBA, 40, 50** (Line would be removed but counted)

新输出:

IPAddress, Value1, Value2, Value3, Count
127.0.0.1, Test1ABC, 10, 20, 2
127.0.0.1, Test2ABC, 20, 30, 1
127.0.0.1, Value1BBA, 40, 50, 2
127.0.0.2, Test1ABC, 10, 20, 1
127.0.0.2, Value1AAB, 20, 30, 2
127.0.0.2, Value2ABA, 30, 40, 1

我尝试过使用集合,但显然无法索引集合。

entries = set()
writer=csv.writer(open('myfilewithoutduplicates.csv', 'w'), delimiter=',')
    for row in list:
    key = (row[0], row[1])
        if re.match(r"(Test1)", key[1]) not in entries:
        entries.add(key)

伪代码?:

# I want to iterate through rows of a csv file and
if row[0] and row[1][:5] match a previous entry:
    remove row
    add count
else:
    add row

非常感谢任何帮助或指导。

4

2 回答 2

1

您需要一本字典来跟踪匹配项。您不需要正则表达式,只需要跟踪前 5 个字符。按它们的“键”存储行,由第一列和第二列的前 5 个字符组成,并添加一个计数。您需要先计数,然后写出收集的行数和计数。

如果订购很重要,您可以将字典替换为,collections.OrderedDict()否则代码相同:

rows = {}

with open(inputfilename, 'rb') as inputfile:
    reader = csv.reader(inputfile)
    headers = next(reader)  # collect first row as headers for the output
    for row in reader:
        key = (row[0], row[1][:5])
        if key not in rows:
            rows[key] = row + [0,]
        rows[key][-1] += 1  # count

with open('myfilewithoutduplicates.csv', 'wb') as outputfile:
    writer = csv.writer(outputfile)
    writer.writerow(headers + ['Count'])
    writer.writerows(rows.itervalues())
于 2013-08-08T12:40:09.070 回答
0

你可以使用numpy

import numpy as np

# import data from file (assume file called a.csv), store as record array:
a  = np.genfromtxt('a.csv',delimiter=',',skip_header=1,dtype=None)

# get the first column and first 5 chars of 2nd col store in array p
p=[x+y for x,y in zip(a['f0'],[a['f1'][z][0:6] for z in range(len(a))])]

#compare elements in p, get indexes of unique entries (m)
k,m = np.unique(p, return_index=True)

# use indexes to create new array without dupes
newlist = [a[v] for v in m]

#the count is the difference in lengths of the arrays
count = len(a)-len(newlist)
于 2013-08-08T16:10:02.587 回答