如果你想使用 nltk 的功能,你可以使用 nltk 的ConcordanceIndex
. 为了使显示的宽度基于字数而不是字符数(后者是 的默认值ConcordanceIndex.print_concordance
),您可以只创建一个ConcordanceIndex
类似这样的子类:
from nltk import ConcordanceIndex
class ConcordanceIndex2(ConcordanceIndex):
def create_concordance(self, word, token_width=13):
"Returns a list of contexts for @word with a context <= @token_width"
half_width = token_width // 2
contexts = []
for i, token in enumerate(self._tokens):
if token == word:
start = i - half_width if i >= half_width else 0
context = self._tokens[start:i + half_width + 1]
contexts.append(context)
return contexts
然后你可以获得这样的结果:
>>> from nltk.tokenize import wordpunct_tokenize
>>> my_corpus = 'The gerenuk fled frantically across the vast valley, whereas the giraffe merely turned indignantly and clumsily loped away from the valley into the nearby ravine.' # my corpus
>>> tokens = wordpunct_tokenize(my_corpus)
>>> c = ConcordanceIndex2(tokens)
>>> c.create_concordance('valley') # returns a list of lists, since words may occur more than once in a corpus
[['gerenuk', 'fled', 'frantically', 'across', 'the', 'vast', 'valley', ',', 'whereas', 'the', 'giraffe', 'merely', 'turned'], ['and', 'clumsily', 'loped', 'away', 'from', 'the', 'valley', 'into', 'the', 'nearby', 'ravine', '.']]
我上面创建的create_concordance
方法是基于 nltk 的ConcordanceIndex.print_concordance
方法,它的工作原理是这样的:
>>> c = ConcordanceIndex(tokens)
>>> c.print_concordance('valley')
Displaying 2 of 2 matches:
valley , whereas the giraffe merely turn
and clumsily loped away from the valley into the nearby ravine .