我正在使用 NLTK 分析一些经典文本,并且遇到了逐句标记文本的麻烦。例如,这是我从Moby Dick获得的片段:
import nltk
sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')
'''
(Chapter 16)
A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but
that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"
'''
sample = 'A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"'
print "\n-----\n".join(sent_tokenize.tokenize(sample))
'''
OUTPUT
"A clam for supper?
-----
a cold clam; is THAT what you mean, Mrs.
-----
Hussey?
-----
" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs.
-----
Hussey?
-----
"
'''
考虑到 Melville 的语法有点过时,我不希望这里完美,但 NLTK 应该能够处理终端双引号和像“Mrs.”这样的标题。但是,由于标记器是无监督训练算法的结果,我不知道如何修补它。
有人对更好的句子标记器有建议吗?我更喜欢我可以破解的简单启发式算法,而不必训练自己的解析器。