2

我想计算已转换为标记的文本文件中特定单词前后三个单词的频率。

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
with open('dracula.txt', 'r', encoding="ISO-8859-1") as textfile:
    text_data = textfile.read().replace('\n', ' ').lower()
tokens = nltk.word_tokenize(text_data)
text = nltk.Text(tokens)
grams = nltk.ngrams(tokens, 4)
freq = Counter(grams)
freq.most_common(20)

我不知道如何搜索字符串“dracula”作为过滤词。我也试过:

text.collocations(num=100)
text.concordance('dracula')

期望的输出看起来像这样的计数: 'dracula' 之前的三个单词,排序计数

(('and', 'he', 'saw', 'dracula'), 4),
(('one', 'cannot', 'see', 'dracula'), 2)

'dracula' 之后的三个单词,排序计数

(('dracula', 'and', 'he', 'saw'), 4),
(('dracula', 'one', 'cannot', 'see'), 2)

中间包含“dracula”的三元组,已排序计数

(('count', 'dracula', 'saw'), 4),
(('count', 'dracula', 'cannot'), 2)

预先感谢您的任何帮助。

4

1 回答 1

1

if一旦你得到元组格式的频率信息,就像你所做的那样,你可以简单地用语句过滤掉你正在寻找的单词。这是使用 Python 的列表理解语法:

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

with open('dracula.txt', 'r', encoding="ISO-8859-1") as textfile:
    text_data = textfile.read().replace('\n', ' ').lower()
    # pulled text from here: https://archive.org/details/draculabr00stokuoft/page/n6

tokens = nltk.word_tokenize(text_data)
text = nltk.Text(tokens)
grams = nltk.ngrams(tokens, 4)
freq = nltk.Counter(grams)

dracula_last = [item for item in freq.most_common() if item[0][3] == 'dracula']
dracula_first = [item for item in freq.most_common() if item[0][0] == 'dracula']
dracula_second = [item for item in freq.most_common() if item[0][1] == 'dracula']
# etc.

这会在不同位置生成带有“dracula”的列表。这是dracula_last看起来的样子:

[(('the', 'castle', 'of', 'dracula'), 3),
 (("'s", 'journal', '243', 'dracula'), 1),
 (('carpathian', 'moun-', '2', 'dracula'), 1),
 (('of', 'the', 'castle', 'dracula'), 1),
 (('named', 'by', 'count', 'dracula'), 1),
 (('disease', '.', 'count', 'dracula'), 1),
 ...]
于 2019-02-01T16:27:30.717 回答