python - 如何在 python 中使用 nltk 找到特定的二元组？

Question

我目前正在使用 nltk.book iny Python，并希望找到特定二元组的频率。我知道有 bigram() 函数可以为您提供文本中最常见的二元组，如以下代码所示：

    >>> list(bigrams(['more', 'is', 'said', 'than', 'done']))
    [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
    >>>

但是，如果我只搜索“希望”之类的特定内容怎么办？到目前为止，我在 nltk 文档中找不到任何关于此的内容。

score 0 · Accepted Answer

如果您可以返回元组列表，则可以使用in：

>>> bgrms = [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>> ('more', 'is') in bgrms
True
>>> ('wish', 'for') in bgrms
False

然后，如果您正在寻找特定二元组的频率，构建计数器可能会有所帮助：

from nltk import bigrams
from collections import Counter

bgrms = list(bigrams(['more', 'is', 'said', 'than', 'wish', 'for', 'wish', 'for']))

bgrm_counter = Counter(bgrms)

# Query the Counter collection for a specific frequency:
print(
  bgrm_counter.get(tuple(["wish", "for"]))
)

输出：

最后，如果您想根据可能的二元组数来了解这个频率，您可以除以可能的二元组数：

# Divide by the length of `bgrms`

print(
  bgrm_counter.get(tuple(["wish", "for"])) / len(bgrms)
)

输出：

0.2857142857142857

python - 如何在 python 中使用 nltk 找到特定的二元组？

1 回答 1

Related

Reference