r - 从文本/句子中提取搭配

Question

我有大量的句子，每个句子都至少包含一次“well”。我想得到一个列表，其中包含紧邻“well”左侧的两个单词和紧邻“well”右侧的两个单词。例如，在句子中

“很好，他们三个相处得很好”

结果应该是左：“NA”“非常”“get”“on”

正确的：“他们”“所有”“一起”“NA”

我确实怀疑 sub() 会很有用并且是正则表达式，但不知道（确切地）如何组装查询。如何做呢？

score 0 · Accepted Answer

quanteda和的组合tidyr将带您到达那里。我离开了库调用，所以你可以看到哪个语句来自哪个包。

text <- "very well they all three get on well together"

library(magrittr)

text %>% 
  quanteda::kwic("well", window = 2) %>% 
  data.frame() %>% 
  tidyr::separate(pre, into = c("pre1", "pre2"), fill = "left") %>% 
  tidyr::separate(post, into = c("post1", "post2"), fill = "right")

  docname from to pre1 pre2 keyword    post1 post2
1   text1    2  2 <NA> very    well     they   all
2   text1    8  8  get   on    well together  <NA>

r - 从文本/句子中提取搭配

1 回答 1

Related

Reference