0

在 test.txt 中:

1   a
2   b
3   c
4   a
5   d
6   c

我想删除重复项并将其余部分保存在 test2.txt 中:

2   b
5   d

我尝试从下面的代码开始。

file1 = open('../test.txt').read().split('\n')
#file2 = open('../test2.txt', "w")
word = set()
for line in file1:
    if line:
        sline = line.split('\t')
        if sline[1] not in word:
            print sline[0], sline[1]              
            word.add(sline[1])
#file2.close()

代码结果显示:

1   a
2   b
3   c
5   d

有什么建议吗?

4

4 回答 4

3

你可以collections.Orderedict在这里使用:

>>> from collections import OrderedDict
with open('abc') as f:
    dic = OrderedDict()
    for line in f:
        v,k = line.split()
        dic.setdefault(k,[]).append(v)

现在dic看起来像:

OrderedDict([('a', ['1', '4']), ('b', ['2']), ('c', ['3', '6']), ('d', ['5'])])

现在我们只需要列表中仅包含 1 个项目的那些键。

for k,v in dic.iteritems():
    if len(v) == 1:          
        print v[0],k
...         
2 b
5 d
于 2013-07-10T13:36:17.627 回答
1

您正在做的是确保每隔一个项目(字母)只打印一次。这显然不是你所说的你想要的。

你必须把你的代码分成两半——阅读和收集关于字母计数的统计数据,以及只打印那些有count == 1.

转换你的原始代码(我只是让它更简单一点):

file1 = open('../test.txt')
words = {}
for line in file1:
    if line:
        line_num, letter = line.split('\t')
        if letter not in words:
            words[letter] = [1, line_num]
        else:
            words[letter][0] += 1

for letter, (count, line_num) in words.iteritems():
    if count == 1:
        print line_num, letter
于 2013-07-10T13:51:00.793 回答
1

我试图让它尽可能与你的风格相似:

file1 = open('../test.txt').read().split('\n')

word = set()
test = []
duplicate = []
sin_duple = []
num_lines = 0;
num_duplicates = 0;
for line in file1:
    if line:
        sline = line.split('   ')
        test.append("   ".join([sline[0], sline[1]]))
        if (sline[1] not in word):
            word.add(sline[1])
            num_lines = num_lines + 1;
        else:
            sin_duple.append(sline[1])
            duplicate.append("   ".join([sline[0], sline[1]]))
            num_lines = num_lines + 1;
            num_duplicates = num_duplicates + 1;

for i in range (0,num_lines+1):
    for item in test:
        for j in range(0, num_duplicates):
            #print((str(i) + "   " + str(sin_duple[j])))
            if item == (str(i) + "   " + str(sin_duple[j])):
                test.remove(item)


file2 = open("../test2.txt", 'w')
for item in test:
    file2.write("%s\n" % item)
file2.close()
于 2013-07-10T17:31:09.010 回答
0

一些熊猫怎么样

import pandas as pd

a = pd.read_csv("test_remove_dupl.txt",sep=",")

b = a.drop_duplicates(cols="a")
于 2013-07-10T13:53:15.640 回答