0

nltk.tokenize.sent_tokenize在所有时期都积极地标记句子,但并非所有时期都标记句子的结尾。

这是一个被错误地分解成许多句子的编造句子:

(see e.g. [5]), real-time i.e. reasoning etc. should be mentioned in ABC et al. for

>>> ['(see e.g.', '[5]), real-time i.e.', 'reasoning etc.', 'should be mentioned in ABC et al.', 'for']

我的要求是防止标记器在某些单词(例如e.g., i.e., etc., et al.. 有什么方法可以处理这个问题nltk吗?

更新:将上述所需的缩写添加到 PunktSentenceTokenizer 缩写中,根本没有帮助。我仍然得到相同的结果。

这是我尝试过的代码片段:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
abbreviation = ['et al.', 'i.e.', 'e.g.', 'etc.']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.tokenize('(see e.g. [5]), real-time i.e. reasoning etc. should be mentioned in ABC et al. for ')

Result:
['(see e.g.', '[5]), real-time i.e.', 'reasoning etc.', 'should be mentioned in ABC et al.', 'for']
4

0 回答 0