python - 如何搜索、统计和保存单词？

Question

我正在尝试识别一个特定的单词，然后计算它。我需要保存每个标识符的计数。

例如，

风险风险无风险利率

星号风险风险

市场风险风险[风险

*一份文件包含上面的文字，我需要计算“风险”而不是星号。我还需要将 [风险视为“风险”。这是我到目前为止所拥有的。但是，它返回星号和 [风险以及风险的计数。我不需要计算星号，只需要计算风险，包括 [risk. 我尝试使用正则表达式，但不断出错。另外，我是 Python 的初学者。如果有人有任何想法，请帮助我！^^谢谢。

from collections import defaultdict
word_dict = defaultdict(int)

for line in mylist:
words = line.lower().split()
for word in words:
    word_dict[word]+=1

for word in word_dict:
if 'risk' in word:
    word, word_dict[word]

score 2 · Accepted Answer

再给一个正则表达式。匹配'risk'由单词边界包围的字符串

import re
re.findall(r'\brisk\b', 'risk risk') ## 2 matches
re.findall(r'\brisk\b', 'risk risk riskrisk') ## 2 matches
re.findall(r'\brisk\b', 'risk risk riskrisk [risk') ## 3 matches
re.findall(r'\brisk\b', 'risk risk riskrisk [risk asterisk') ## 3 matches

score 1 · Accepted Answer

采用流水线方法。我的意思是，在将单词添加到字典之前，对文本执行任何转换，以便计数正确。

word_dict = {} # empty dictionary

for line in mylist:
    words = line.strip().lower().split() # the strip gets rid of new lines
    for word in words:
        # the strip here will strip away any surrounding punctuation.
        # add any other symbols to the string that you need
        # the key insight here, is you get rid of extra stuff BEFORE inserting
        # into the dictionary
        word_dict[word.strip('[/@#$%')]+=1 

for word in word_dict:
    print word, word_dict[word]

# to just see the count for risk:
print word_dict['risk']

它计算“星号”这个词的事实很好，只要你计算你的“风险”这个词。

score 0 · Accepted Answer

你可以试试这个片段：

import shlex

words = shlex.split("risk risk risk free interest rate")
word_count = len([word for word in words if word == "risk" or word =="[risk"])
print word_count

score 0 · Accepted Answer

我认为您需要更严格地定义哪些标准重要risk，哪些不重要。但是，我会使用Counter：

from collections import Counter
c = Counter()
with open(yourfile) as f:
    for line in f:
        c += Counter(line.split())

现在，您需要创建一个函数来确定它是否应该算作“风险”：

def is_risk(word):
    w = word.lower()
    return 'risk' in w and w!='asterisk'

现在只需添加与这些键对应的元素：

sum( c[k] for k in c if is_risk(k) )

score -2 · Accepted Answer

-2

所以你数

'\n' + risk + '\n'
'\n' + risk + ' '
' ' + risk + '\n'
' ' + risk + ' '

于 2012-08-31T13:39:20.650 回答

python - 如何搜索、统计和保存单词？

5 回答 5

Related

Reference