我想分解string
成句子:
library(NLP) # NLP_0.1-7
string <- as.String("Mr. Brown comes. He says hello. i give him coffee.")
我想展示两种不同的方式。一个来自包openNLP
:
library(openNLP) # openNLP_0.2-5
sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = "en")
boundaries_sentences<-annotate(string, sentence_token_annotator)
string[boundaries_sentences]
[1] "Mr. Brown comes." "He says hello." "i give him coffee."
第二个来自 package stringi
:
library(stringi) # stringi_0.5-5
stri_split_boundaries( string , opts_brkiter=stri_opts_brkiter('sentence'))
[[1]]
[1] "Mr. " "Brown comes. "
[3] "He says hello. i give him coffee."
在第二种方式之后,我需要准备句子以删除多余的空格或将新字符串再次分解为句子。我可以调整 stringi 函数来提高结果的质量吗?
当它涉及大数据时,openNLP
(非常)慢stringi
。
有没有办法结合stringi
(->快速)和openNLP
(->质量)?