python - NLTK：为单词查找大小为 2k 的上下文

Question

我有一个语料库，我有一个词。对于语料库中单词的每次出现，我想获得一个列表，其中包含单词之前的 k 个单词和单词之后的 k 个单词。我在算法上做得很好（见下文），但我想知道 NLTK 是否为我错过的需求提供了一些功能？

def sized_context(word_index, window_radius, corpus):
    """ Returns a list containing the window_size amount of words to the left
    and to the right of word_index, not including the word at word_index.
    """

    max_length = len(corpus)

    left_border = word_index - window_radius
    left_border = 0 if word_index - window_radius < 0 else left_border

    right_border = word_index + 1 + window_radius
    right_border = max_length if right_border > max_length else right_border

    return corpus[left_border:word_index] + corpus[word_index+1: right_border]

score 6 · Accepted Answer

如果你想使用 nltk 的功能，你可以使用 nltk 的ConcordanceIndex. 为了使显示的宽度基于字数而不是字符数（后者是的默认值ConcordanceIndex.print_concordance），您可以只创建一个ConcordanceIndex类似这样的子类：

from nltk import ConcordanceIndex

class ConcordanceIndex2(ConcordanceIndex):
    def create_concordance(self, word, token_width=13):
        "Returns a list of contexts for @word with a context <= @token_width"
        half_width = token_width // 2
        contexts = []
        for i, token in enumerate(self._tokens):
            if token == word:
                start = i - half_width if i >= half_width else 0
                context = self._tokens[start:i + half_width + 1]
                contexts.append(context)
        return contexts

然后你可以获得这样的结果：

>>> from nltk.tokenize import wordpunct_tokenize
>>> my_corpus = 'The gerenuk fled frantically across the vast valley, whereas the giraffe merely turned indignantly and clumsily loped away from the valley into the nearby ravine.'  # my corpus
>>> tokens = wordpunct_tokenize(my_corpus)
>>> c = ConcordanceIndex2(tokens)
>>> c.create_concordance('valley')  # returns a list of lists, since words may occur more than once in a corpus
[['gerenuk', 'fled', 'frantically', 'across', 'the', 'vast', 'valley', ',', 'whereas', 'the', 'giraffe', 'merely', 'turned'], ['and', 'clumsily', 'loped', 'away', 'from', 'the', 'valley', 'into', 'the', 'nearby', 'ravine', '.']]

我上面创建的create_concordance方法是基于 nltk 的ConcordanceIndex.print_concordance方法，它的工作原理是这样的：

>>> c = ConcordanceIndex(tokens)
>>> c.print_concordance('valley')
Displaying 2 of 2 matches:
                                  valley , whereas the giraffe merely turn
 and clumsily loped away from the valley into the nearby ravine .

score 3 · Accepted Answer

执行此操作的最简单的 nltk-ish 方法是使用nltk.ngrams().

words = nltk.corpus.brown.words()
k = 5
for ngram in nltk.ngrams(words, 2*k+1, pad_left=True, pad_right=True, pad_symbol=" "):
    if ngram[k+1].lower() == "settle":
        print(" ".join(ngram))

pad_left并pad_right确保查看所有单词。如果您不让索引跨越句子（因此：很多边界情况），这一点很重要。

如果要忽略窗口大小中的标点符号，可以在扫描前将其剥离：

words = (w for w in nltk.corpus.brown.words() if re.search(r"\w", w))

python - NLTK：为单词查找大小为 2k 的上下文

2 回答 2

Related

Reference