4

我想从文本文件中删除重复的单词。

我有一些文本文件,其中包含如下内容:

None_None

ConfigHandler_56663624
ConfigHandler_56663624
ConfigHandler_56663624
ConfigHandler_56663624

None_None

ColumnConverter_56963312
ColumnConverter_56963312

PredicatesFactory_56963424
PredicatesFactory_56963424

PredicateConverter_56963648
PredicateConverter_56963648

ConfigHandler_80134888
ConfigHandler_80134888
ConfigHandler_80134888
ConfigHandler_80134888

结果输出需要是:

None_None

ConfigHandler_56663624

ColumnConverter_56963312

PredicatesFactory_56963424

PredicateConverter_56963648

ConfigHandler_80134888

我只使用了这个命令: en=set(open('file.txt') 但它不起作用。

谁能帮助我如何从文件中仅提取唯一集

谢谢

4

7 回答 7

8

这是一个使用集合从文本文件中删除重复项的简单解决方案。

lines = open('workfile.txt', 'r').readlines()

lines_set = set(lines)

out  = open('workfile.txt', 'w')

for line in lines_set:
    out.write(line)
于 2013-04-05T09:40:33.633 回答
5

这是关于保留顺序的选项(与集合不同),但仍然具有相同的行为(请注意,故意删除 EOL 字符并忽略空行)...

from collections import OrderedDict

with open('/home/jon/testdata.txt') as fin:
    lines = (line.rstrip() for line in fin)
    unique_lines = OrderedDict.fromkeys( (line for line in lines if line) )

print unique_lines.keys()
# ['None_None', 'ConfigHandler_56663624', 'ColumnConverter_56963312',PredicatesFactory_56963424', 'PredicateConverter_56963648', 'ConfigHandler_80134888']

然后,您只需要将上述内容写入您的输出文件。

于 2013-04-05T09:32:17.237 回答
2

以下是使用集合(无序结果)执行此操作的方法:

from pprint import pprint

with open('input.txt', 'r') as f:
    print pprint(set(f.readlines()))

此外,您可能希望摆脱新行字符。

于 2013-04-05T09:30:20.750 回答
1
def remove_duplicates(infile):
    storehouse = set()
    with open('outfile.txt', 'w+') as out:
        for line in open(infile):
            if line not in storehouse:
                out.write(line)
                storehouse.add(line)

remove_duplicates('infile.txt')
于 2016-07-29T12:06:37.997 回答
0

如果您只想获得不重复的输出,您可以使用uniqandsort

hvn@lappy: /tmp () $ sort -nr dup | uniq
PredicatesFactory_56963424
PredicateConverter_56963648
None_None
ConfigHandler_80134888
ConfigHandler_56663624
ColumnConverter_56963312

对于蟒蛇:

In [2]: with open("dup", 'rt') as f:
    lines = f.readlines()
   ...:     

In [3]: lines
Out[3]: 
['None_None\n',
 '\n',
 'ConfigHandler_56663624\n',
 'ConfigHandler_56663624\n',
 'ConfigHandler_56663624\n',
 'ConfigHandler_56663624\n',
 '\n',
 'None_None\n',
 '\n',
 'ColumnConverter_56963312\n',
 'ColumnConverter_56963312\n',
 '\n',
 'PredicatesFactory_56963424\n',
 'PredicatesFactory_56963424\n',
 '\n',
 'PredicateConverter_56963648\n',
 'PredicateConverter_56963648\n',
 '\n',
 'ConfigHandler_80134888\n',
 'ConfigHandler_80134888\n',
 'ConfigHandler_80134888\n',
 'ConfigHandler_80134888\n']

In [4]: set(lines)
Out[4]: 
set(['ColumnConverter_56963312\n',
     '\n',
     'PredicatesFactory_56963424\n',
     'ConfigHandler_56663624\n',
     'PredicateConverter_56963648\n',
     'ConfigHandler_80134888\n',
     'None_None\n'])
于 2013-04-05T09:32:39.403 回答
0
import json
myfile = json.load(open('yourfile', 'r'))
uniq = set()
for p in myfile:
if p in uniq:
    print "duplicate : " + p
    del p
else:
    uniq.add(p)
print uniq
于 2013-04-05T10:38:07.837 回答
0

这样就可以取出放入的相同文件

import uuid

def _remove_duplicates(filePath):
  f = open(filePath, 'r')
  lines = f.readlines()
  lines_set = set(lines)
  tmp_file=str(uuid.uuid4())
  out=open(tmp_file, 'w')
  for line in lines_set:
    out.write(line)
  f.close()
  os.rename(tmp_file,filePath)
于 2016-07-28T14:31:44.247 回答