我正在尝试创建一个程序,该程序通过心理健康术语列表运行,查看研究摘要,并计算单词或短语出现的次数。我可以用单个单词来解决这个问题,但我很难用多个单词来做到这一点。我也尝试使用 NLTK ngram,但由于心理健康列表中的单词数量各不相同(即,并非心理健康列表中的所有术语都是二元组或三元组),我也无法让它发挥作用。
我想强调一下,我知道拆分每个单词只会计算单个单词,但是,我只是坚持如何处理列表中不同数量的单词以计入摘要。
谢谢!
from collections import Counter
abstracts = ['This is a mental health abstract about anxiety and bipolar
disorder as well as other things.', 'While this abstract is not about ptsd
or any trauma-related illnesses, it does have a mental health focus.']
for x2 in abstracts:
mh_terms = ['bipolar disorder', 'anxiety', 'substance abuse disorder',
'ptsd', 'schizophrenia', 'mental health']
c = Counter(s.lower().replace('.', '') for s in x2.split())
for term in mh_terms:
term = term.replace(',','')
term = term.replace('.','')
xx = (term, c.get(term, 0))
mh_total_occur = sum(c.get(v, 0) for v in mh_terms)
print(mh_total_occur)
在我的示例中,两个摘要都计数为 1,但我想要计数为 2。