如何根据两列从 csv 文件中删除重复行,其中一个列使用正则表达式确定匹配并按第一个字段(IPAddress)分组。最后在行中添加一个计数字段来计算重复行:
.csv 文件:
IPAddress, Value1, Value2, Value3
127.0.0.1, Test1ABC, 10, 20
127.0.0.1, Test2ABC, 20, 30
127.0.0.1, Test1ABA, 30, 40
127.0.0.1, Value1BBA, 40, 50
127.0.0.1, Value1BBA, 40, 50
127.0.0.2, Test1ABC, 10, 20
127.0.0.2, Value1AAB, 20, 30
127.0.0.2, Value2ABA, 30, 40
127.0.0.2, Value1BBA, 40, 50
我想在 IPAddress 和 Value1 上进行匹配(如果前 5 个字符匹配,则 Value1 是匹配的)。
这会给我:
IPAddress, Value1, Value2, Value3, Count
127.0.0.1, Test1ABC, 10, 20, 2
127.0.0.1, Test2ABC, 20, 30, 1
**127.0.0.1, Test1ABA, 30, 40** (Line would be removed but counted)
127.0.0.1, Value1BBA, 40, 50, 2
**127.0.0.1, Value1BBA, 40, 50** (Line would be removed but counted)
127.0.0.2, Test1ABC, 10, 20, 1
127.0.0.2, Value1AAB, 20, 30, 2
127.0.0.2, Value2ABA, 30, 40, 1
**127.0.0.2, Value1BBA, 40, 50** (Line would be removed but counted)
新输出:
IPAddress, Value1, Value2, Value3, Count
127.0.0.1, Test1ABC, 10, 20, 2
127.0.0.1, Test2ABC, 20, 30, 1
127.0.0.1, Value1BBA, 40, 50, 2
127.0.0.2, Test1ABC, 10, 20, 1
127.0.0.2, Value1AAB, 20, 30, 2
127.0.0.2, Value2ABA, 30, 40, 1
我尝试过使用集合,但显然无法索引集合。
entries = set()
writer=csv.writer(open('myfilewithoutduplicates.csv', 'w'), delimiter=',')
for row in list:
key = (row[0], row[1])
if re.match(r"(Test1)", key[1]) not in entries:
entries.add(key)
伪代码?:
# I want to iterate through rows of a csv file and
if row[0] and row[1][:5] match a previous entry:
remove row
add count
else:
add row
非常感谢任何帮助或指导。