regex - 如何使用 OpenNLP 和 stringi 检测句子边界？

Question

我想分解string成句子：

library(NLP) # NLP_0.1-7  
string <- as.String("Mr. Brown comes. He says hello. i give him coffee.")

我想展示两种不同的方式。一个来自包openNLP：

library(openNLP) # openNLP_0.2-5  

sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = "en")  
boundaries_sentences<-annotate(string, sentence_token_annotator)  
string[boundaries_sentences]  

[1] "Mr. Brown comes."   "He says hello."     "i give him coffee."

第二个来自 package stringi：

library(stringi) # stringi_0.5-5  

stri_split_boundaries( string , opts_brkiter=stri_opts_brkiter('sentence'))

[[1]]  
 [1] "Mr. "                              "Brown comes. "                    
 [3] "He says hello. i give him coffee."

在第二种方式之后，我需要准备句子以删除多余的空格或将新字符串再次分解为句子。我可以调整 stringi 函数来提高结果的质量吗？

当它涉及大数据时，openNLP（非常）慢stringi。
有没有办法结合stringi（->快速）和openNLP（->质量）？

score 9 · Accepted Answer

ICU（以及 stringi）中的文本边界（在本例中为句子边界）分析由 Unicode UAX29中描述的规则管理，另请参阅有关该主题的 ICU 用户指南。我们读：

[Unicode 规则] 无法检测到诸如“......先生。琼斯……”；需要更复杂的剪裁来检测这种情况。

换句话说，如果没有非停用词的自定义词典，这实际上是在openNLP. 因此，结合 stringi 来执行此任务的一些可能场景包括：

使用stri_split_boundaries然后编写一个函数来决定应该加入哪些错误拆分的标记。
手动在文本中输入不间断的空格（可能在点后面的etc.、Mr.、ie等之后（请注意，这实际上是在 LaTeX 中准备文档时所必需的——否则单词之间的空格会太大）。
将自定义非停用词列表合并到正则表达式中并应用stri_split_regex.

等等。

score 5 · Accepted Answer

这可能是一个可行的正则表达式解决方案：

string <- "Mr. Brown comes. He says hello. i give him coffee."
stringi::stri_split_regex(string, "(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?|\\!)\\s")

## [[1]]
## [1] "Mr. Brown comes."   "He says hello."     "i give him coffee."

在以下方面表现不佳：

string <- "Mr. Brown comes! He says hello. i give him coffee.  i will got at 5 p. m. eastern time.  Or somewhere in between"

regex - 如何使用 OpenNLP 和 stringi 检测句子边界？

2 回答 2

Related

Reference