python - 如何将文本中搭配的条件频率分布制表

Question

我有几个找到搭配的文本，现在我想创建一个表格，显示每个搭配出现在语料库的每个文本中的次数。

当我从生成表格或绘图时ConditionalFreqDist，它仅显示每个文本中每个搭配的 1 个匹配项。

我是 Python 新手，显然做错了什么......请帮忙。

这是我如何获得搭配：

>>> import nltk
>>> from nltk.corpus import PlaintextCorpusReader
>>> eng_corpus_root = 'D:\Corpus\EN'
>>> eng_corpus = PlaintextCorpusReader(eng_corpus_root, '.*')
>>> # Below: this is the script that imports corpora for 4 languages from a local folder
>>> from Import4Corpuses3 import *
>>> import nltk
>>> # Below: tengc_low is the variable for English corpus (60 texts) as text objects, all letters changed to lowercase
>>> tengc_low.collocation_list()
['hong kong', 'united states', 'getty images', 'european union', 'prime minister', 'northern ireland', 'boris johnson', 'cape dorset', 'extinction rebellion', 'extradition bill', 'cease fire', 'islamic state', 'recep tayyip', 'turkish backed', 'vice president', 'mike pence', 'tayyip erdogan', 'twitter com', 'pic twitter', 'anthony kwan']

以下是我尝试获取搭配和文本名称的 ConditionalFreqDist 的方法：

>>> cfd = nltk.ConditionalFreqDist(
    (textname, collocation)
    for textname in eng_corpus.fileids()
    for collocation in Text(eng_corpus.words()).collocation_list(num=100))

然后，如前所述，我为每个文本中的每个搭配得到“1”。

我怎样才能得到正确的分布？

将不胜感激任何建议。

python - 如何将文本中搭配的条件频率分布制表

0 回答 0

Related

Reference