你是对的,很难找到book.py
模块的文档。因此,我们必须亲自动手并查看代码(请参见此处)。看看 ,book.py
用 book 模块做协调和所有花哨的东西:
首先,您必须将原始文本放入 nltk 的corpus
类中,有关详细信息,请参阅使用 NLTK 创建新语料库。
其次,您将语料库单词读入 NLTK 的Text
课程中。然后您可以使用您在http://nltk.org/book/ch01.html中看到的功能
from nltk.corpus import PlaintextCorpusReader
from nltk.text import Text
# For example, I create an example text file
text1 = '''
This is a story about a foo bar. Foo likes to go to the bar and his last name is also bar. At home, he kept a lot of gold chocolate bars.
'''
text2 = '''
One day, foo went to the bar in his neighborhood and was shot down by a sheep, a blah blah black sheep.
'''
# Creating the corpus
corpusdir = './mycorpus/'
with (corpusdir+'text1.txt','w') as fout:
fout.write(text1)
with (corpusdir+'text2.txt','w') as fout:
fout.write(text2, fout)
# Read the the example corpus into NLTK's corpus class.
mycorpus = PlaintextCorpusReader(corpusdir, '.*')
# Read the NLTK's corpus into NLTK's text class,
# where your book-like concoordance search is available
mytext = Text(mycorpus.words())
mytext.concoordance('foo')
注意:您可以使用其他 NLTK 的 CorpusReaders,甚至可以指定自定义段落/句子/单词标记器和编码,但现在,我们将坚持使用默认值