python - 删除重复项，如果找到则从行中删除某些字母

Question

Python 新手。

我想从文件中删除重复的行以及某些字符。

例如我有以下文件：

A   786 65534 65534 786 786 786 786 10026/AS4637 19151 19151 19151 19151 19151 19151 10796/AS13706
B   786 65534 65534 786 786 786 3257 3257 3257 1257 1257 1257 1257 1257 1257 1257 49272

我想要的输出是：

A   786 10026 4637 19151 10796 13706
B   786 3257 1257 49272

这里发生了两件事，首先需要删除任何具有 #65000 的行。其次，有时您会得到两个字符除以“/”，并且其中包含不需要的字母，例如我不想要的#AS。

我有以下代码：

import os

p = './testing/test.txt'
fin = open(p, 'r')
uniq = set()
for line in fin.readlines():
    word = line.rstrip().split(' ')[3:]
    if not word in uniq:
        uniq.add(word)
        print word
ips.close()

我得到一个：

TypeError: unhashable type: 'list'

如您所见，我什至无法检查单词是否大于 65000，因为我什至无法通过 set() 删除重复项

请帮助解决这个问题。

请我真的可以在这里使用一些帮助

score 0 · Accepted Answer

问题是：

word = line.rstrip().split(' ')[3:]

split 函数返回一个单词列表。List 不可散列，因此您不能在其中使用或添加它。您需要遍历拆分列表中的字符串，并逐个检查每个单词。

score 0 · Accepted Answer

作为开始，这可能会有所帮助：

for line in fin.readlines():
    words = line.split()    # list of words
    new_words = []
    unique_words = set()
    for word in words:
        if (word not in unique_words and
                  (not word.isdigit() or int(word) <= 65000)):
            new_words.append(word)
            unique_words.add(word)
    new_line = ' '.join(new_words)
    print new_line

变成这样：

A   786 65534 65534 786 786 786 786 10026/AS4637 19151 19151 19151 19151 19151     19151 10796/AS13706

进入这个：

A 786 10026/AS4637 19151 10796/AS13706

显然，这还不是您想要的，但请尝试自己完成其余的工作。:) 该str.replace()方法可能会帮助您摆脱那些/AS.

python - 删除重复项，如果找到则从行中删除某些字母

2 回答 2

Related

Reference