python - python - 如何在Python中创建一个在unigrams之前对ngrams进行评分的函数？

Question

假设我想text用一个名为的字典评分dictionary：

text = "I would like to reduce carbon emissions"

dictionary = pd.DataFrame({'text': ["like","reduce","carbon","emissions","reduce carbon emissions"],'score': [1,-1,-1,-1,1]})

我想编写一个函数，将 indictionary中的每个术语相加text。但是，这样的规则必须有细微差别：优先考虑 ngrams 而不是 unigrams。

具体来说，如果我总结其中的一元dictionary，text我得到：1+(-1)+(-1)+(-1)=-2因为like =1, reduce=-1, carbon =-1,emissions=-1。这不是我想要的。该函数必须说明以下内容：

首先考虑ngrams（reduce carbon emissions在示例中），如果ngrams的集合不为空，则为其赋予相应的值，否则如果ngrams的集合为空，则考虑unigrams；
如果 ngrams 集合非空，则忽略所选 ngrams 中的那些单个单词（unigrams）（例如，忽略已经在“减少碳排放”中的“减少”、“碳”和“排放”）。

这样的函数应该给我这个输出：+2因为like =1+ reduce carbon emissions = 1。

我对 Python 很陌生，我被困住了。谁能帮我这个？

谢谢！

score 1 · Accepted Answer

我会按长度对关键字进行降序排序，因此可以保证re在一克之前匹配 ngram：

import re

pat = '|'.join(sorted(dictionary.text, key=len, reverse=True))

found = re.findall(fr'\b({pat})\b', text)

输出：

['like', 'reduce carbon emissions']

要获得预期的输出：

scores = dictionary.set_index('text')['score']

scores.re_index(found).sum()

python - python - 如何在Python中创建一个在unigrams之前对ngrams进行评分的函数？

1 回答 1

Related

Reference