0

我有大量的句子,每个句子都至少包含一次“well”。我想得到一个列表,其中包含紧邻“well”左侧的两个单词和紧邻“well”右侧的两个单词。例如,在句子中

“很好,他们三个相处得很好”

结果应该是左:“NA”“非常”“get”“on”

正确的:“他们”“所有”“一起”“NA”

我确实怀疑 sub() 会很有用并且是正则表达式,但不知道(确切地)如何组装查询。如何做呢?

4

1 回答 1

0

quanteda和的组合tidyr将带您到达那里。我离开了库调用,所以你可以看到哪个语句来自哪个包。

text <- "very well they all three get on well together"

library(magrittr)

text %>% 
  quanteda::kwic("well", window = 2) %>% 
  data.frame() %>% 
  tidyr::separate(pre, into = c("pre1", "pre2"), fill = "left") %>% 
  tidyr::separate(post, into = c("post1", "post2"), fill = "right")

  docname from to pre1 pre2 keyword    post1 post2
1   text1    2  2 <NA> very    well     they   all
2   text1    8  8  get   on    well together  <NA>
于 2018-04-25T16:15:39.930 回答