-1

我需要你的帮助。

我有一个包含多行列表的文本文件,每行代表一个项目列表。我需要提取所有频率 >=2 的项目并将它们输出到另一个文件中。这里是一个例子。

['COLG-CAD-406', 'CSAL-CAD-030', 'COLG-CAD-533', 'COLG-CAD-188']

['COLG-CAD-188']

['CSAL-CAD-030']

['EPHAG-JAE-004']

['COLG-CAD-188', 'CEM-SEV-004']

['COL-CAD-188', 'COLG-CAD-406']

输出应该是

['COLG-CAD-406'], 2

['CSAL-CAD-030'], 2

['COLG-CAD-188'], 3

依此类推,直到文件结束

非常感谢您提前提供的帮助。

4

4 回答 4

2

关于什么:

for x in f.readlines():
    words = ast.literal_eval(x)
    count = {}
    for w in words:        
        count[w] = count.get(w, 0) + 1
    for word, freq in count.iteritems():
        if freq >= 2:
            print word, freq

f你的文件在哪里

于 2012-04-27T18:20:38.293 回答
0

Input:

['COLG-CAD-406', 'CSAL-CAD-030', 'COLG-CAD-533', 'COLG-CAD-188']

['COLG-CAD-188']

['CSAL-CAD-030']

['EPHAG-JAE-004']

['COLG-CAD-188', 'CEM-SEV-004']

['COL-CAD-188', 'COLG-CAD-406']

Output

>>> from collections import Counter
>>> from ast import literal_eval
>>> with open('input.txt') as f:
        c = Counter(word for line in f if line.strip() for word in literal_eval(line))


>>> print '\n'.join('{0}, {1}'.format([word],freq) for word,freq in c.iteritems() if freq >= 2)
['CSAL-CAD-030'], 2
['COLG-CAD-406'], 2
['COLG-CAD-188'], 3
于 2012-04-28T14:03:46.777 回答
0

如果您使用的是 python 2.7 及更高版本,请使用此输入(称为list1.txt):

['COLG-CAD-406', 'CSAL-CAD-030', 'COLG-CAD-533', 'COLG-CAD-188']
['COLG-CAD-188']
['CSAL-CAD-030']
['EPHAG-JAE-004']
['COLG-CAD-188', 'CEM-SEV-004']
['COLG-CAD-188', 'COLG-CAD-406']

这个python程序:

from collections import Counter
import ast

cnt = Counter()

with open("list1.txt") as lfile:
    for line in lfile:
        # eval() could lead to python code injection so use literal_eval
        # the result is a list that we can directly use to update cnt keys
        cnt.update(ast.literal_eval(line))

for k, v in iter(cnt.items()):
    if v>=2:
        print("%s: %d"%  (k, v))

你得到你想要的:

CSAL-CAD-030: 2
COLG-CAD-406: 2
COLG-CAD-188: 4
于 2012-04-27T18:50:48.583 回答
0

这是一个完整的脚本,它使用正则表达式完全满足您的要求:

from collections import defaultdict
import re

myarch = 'C:/code/test5.txt'   #this is your archive
mydict = defaultdict(int)

with open(myarch) as f:
    for line in f:
        codes = re.findall("\'(\S*)\'", line)
        for key in codes:
            mydict[key] +=1

out = []
for key, value in mydict.iteritems():
    if value > 1:
        text = "['%s'], %s" % (key, value)
        out.append(text)

#save to a file
with open('C:/code/fileout.txt', 'w') as fo:
    fo.write('\n'.join(out))

这可以简化为:

from collections import defaultdict
import re

myarch = 'C:/code/test5.txt'
mydict = defaultdict(int)

with open(myarch) as f:
    for line in f:
        for key in re.findall("\'(\S*)\'", line):
            mydict[key] +=1

out = ["['%s'], %s" % (key, value) for key, value in mydict.iteritems() if value > 1]

#save to a file
with open('C:/code/fileout.txt', 'w') as fo:
    fo.write('\n'.join(out))
于 2012-04-27T19:13:58.697 回答