我有 .txt 文件(示例):
专业人士是指从事某种活动或职业,以获取或补偿作为谋生手段的人;例如永久的职业,而不是业余或消遣。由于许多专业服务的个人和机密性质,因此需要对它们给予极大的信任,大多数专业人士都受到严格的道德和道德义务的严格行为准则的约束。
“专业”二字怎么数?(使用 NLTK - 是最好的选择吗?)
text_file = open("text.txt", "r+b")
可以在一行中解决(加上导入):
>>> from collections import Counter
>>> Counter(w.lower() for w in open("text.txt").read().split())['professional']
2
我更改了答案以更好地反映您的意愿:
from nltk import word_tokenize
with open('file_path') as f:
content = f.read()
# we will use your text example instead:
content = "A professional is a person who is engaged in a certain activity, or occupation, for gain or compensation as means of livelihood; such as a permanent career, not as an amateur or pastime. Due to the personal and confidential nature of many professional services, and thus the necessity to place a great deal of trust in them, most professionals are subject to strict codes of conduct enshrining rigorous ethical and moral obligations."
def Count_Word(word, data):
c = 0
tokens = word_tokenize(data)
for token in tokens:
token = token.lower()
# this plural check is dangerous, if trying to find a word that ends with an 's'
token = token[:-1] if token[-1] == 's' else token
if token == word:
c += 1
return c
print Count_Word('professional', content)
>>>
3
这是该方法的修改版本:
def Count_Word(word, data, leading=[], trailing=["'s", "s"]):
c = 0
tokens = word_tokenize(data)
for token in tokens:
token = token.lower()
for lead in leading:
if token.startswith(lead):
token = token.partition(lead)[2]
for trail in trailing:
if token.endswith(trail):
token = token.rpartition(trail)[0]
if token == word:
c += 1
return c
我已经添加到可选参数中,这些参数是您想要修剪以找到它的单词的前导或尾随部分的列表......目前我只放了一个默认值's
or s
。但是,如果您发现其他人适合您,您可以随时添加它们。如果列表开始变长,您可以将它们设为常量。
您可以简单地对字符串进行标记,然后搜索所有标记......但这只是一种方法。还有很多其他...
s = text_file.read()
tokens = nltk.word_tokenize(s)
counter = 0
for token in tokens:
toke = token
if token[-1] == "s":
toke = token[0:-1]
if toke.lower() == "professional":
counter += 1
print counter
from collections import Counter
def stem(word):
if word[-1] == 's':
word = word[:-1]
return word.lower()
print Counter(map(stem, open(filename).read().split()))
您的问题的答案取决于您想要计算的确切内容以及您想要投入多少精力进行标准化。我看到至少三种方法,具体取决于您的目标。
在下面的代码中,我定义了三个函数,它们返回输入文本中出现的所有单词的计数字典。
import nltk
from collections import defaultdict
text = "This is my sample text."
lower = text.lower()
tokenized = nltk.word_tokenize(lower)
ps = nltk.stem.PorterStemmer()
wnlem = nltk.stem.WordNetLemmatizer()
# The Porter stemming algorithm tries to remove all suffixes from a word.
# There are better stemming algorithms out there, some of which may be in NLTK.
def StemCount(token_list):
countdict = defaultdict(int)
for token in token_list:
stem = ps.stem(token)
countdict[stem] += 1
return countdict
# Lemmatizing is a little less brutal than stemming--it doesn't try to relate
# words across parts of speech so much. You do, however, need to part of speech tag
# the text before you can use this approach.
def LemmaCount(token_list):
# Where mytagger is a part of speech tagger
# you've trained (perhaps per http://nltk.sourceforge.net/doc/en/ch03.html)
# using a simple tagset compatible with WordNet (i.e. all nouns become 'n', etc)
token_pos_tuples = mytagger.tag(token_list)
countdict = defaultdict(int)
for token_pos in token_pos_tuples:
lemma = wnlem.lemmatize(token_pos[0],token_pos[1])
countdict[lemma] += 1
# Doesn't do anything fancy. Just counts the number of occurrences for each unique
# string in the input.
def SimpleCount(token_list):
countdict = defaultdict(int)
for token in token_list:
countdict[token] += 1
return countdict
为了举例说明 PorterStemmer 和 WordNetLemmatizer 之间的差异,请考虑以下几点:
>>> wnlem.lemmatize('professionals','n')
'professional'
>>> ps.stem('professionals')
'profession'
使用上面代码片段中定义的 wnlem 和 ps。
根据您的应用程序,类似 SimpleCount(token_list) 的东西可能工作得很好。