regex - 通过混合语法和正则表达式模式搜索字符串

Question

我想使用 R 在文本中搜索通过 POS 和实际字符串混合表达的模式。（我在这里的 python 库中看到了这个功能：http: //www.clips.ua.ac.be/pages/pattern-search）。

例如，搜索模式可以是: 'NOUNPHRASE be|is|was ADJECTIVE than NOUNPHRASE'，并且应该返回包含以下结构的所有字符串：“a cat is faster than a dog”。

我知道包裹喜欢openNLP并qdap提供方便的 POS 标记。有没有人将它的输出用于这种模式加工？

score 2 · Accepted Answer

作为初学者，使用koRpusand TreeTagger：

library(koRpus) 
library(tm)
mytxt <- c("This is my house.", "A house is better than no house.", "A cat is faster than a dog.")
pattern <- "Noun, singular or mass.*?Adjective, comparative.*?Noun, singular or mass"

tagged.results <- treetag(file = mytxt, treetagger="C:/TreeTagger/bin/tag-english.bat", lang="en", format="obj", stopwords=stopwords("en")) 
tagged.results <- kRp.filter.wclass(tagged.results, "stopword")
taggedText(tagged.results)$id <- factor(head(cumsum(c(0, taggedText(tagged.results)$desc == "Sentence ending punctuation")) + 1, -1))

setNames(mytxt, grepl(pattern, aggregate(desc~id, taggedText(tagged.results), FUN = paste0)$desc))
#               FALSE                               TRUE                               TRUE 
# "This is my house." "A house is better than no house."      "A cat is faster than a dog."

regex - 通过混合语法和正则表达式模式搜索字符串

1 回答 1

Related

Reference