nlp - 从文本文档中提取句子

Question

我有一个文本文档，我想从中提取名词短语。在第一步中，我提取句子，然后对每个句子进行词性 (pos) 标记，然后使用 pos 进行分块。我使用 StanfordNLP 来完成这些任务，这是提取句子的代码。

Reader reader = new StringReader(text);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);

我认为DocumentPreprocessor在引擎盖下做一个 pos 以提取句子。但是，我也在做另一个 pos 来提取第二阶段的名词短语。也就是说， pos 执行了两次，因为 pos 是一项计算成本很高的任务，所以我正在寻找一种只执行一次的方法。有没有办法只做一次 pos 来提取句子和名词短语？

score 0 · Accepted Answer

不，DocumentPreprocessor在加载文本时不运行标记器。（注意，它确实具有解析预先标记的文本的能力，即解析文件中的标记，如dog_NN.）

简而言之：你没有做额外的工作，所以我想这是个好消息！

score 0 · Accepted Answer

我不确定。尝试使用 nltk（python 包）

import nltk  
text = word_tokenize("And now for something completely different")  
nltk.pos_tag(text)  
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]

nlp - 从文本文档中提取句子

2 回答 2

Related

Reference