这是我的做法:
keyword <- "moon"
lookaround <- 2
pattern <- paste0("([[:alpha:]]+ ){0,", lookaround, "}", keyword,
"( [[:alpha:]]+){0,", lookaround, "}")
regmatches(str, regexpr(pattern, str))[[1]]
# [1] "The cow jumped over"
想法:搜索任何字符,后跟一个空格,最少出现 0 次,最多出现“lookaround”(此处为 2)次,然后是“keyword”(此处为“moon”),然后是空格和一堆字符模式在 0 和“环视”次之间重复。该regexpr
函数给出了该模式的开始和停止。regmatches
包装这个函数然后从这个开始/停止位置获取子字符串。
注意:如果您想搜索超过 1 次出现的相同模式,regexpr
可以替换为。gregexpr
这是将Hong与此答案进行比较的大数据基准测试:
str <- "The cow jumped over the moon with a silver plate in its mouth"
ll <- rep(str, 1e5)
hong <- function(str) {
str <- strsplit(str, " ")
sapply(str, function(y) {
i <- which(y=="moon")
paste(y[seq(max(1, (i-2)), min((i+2), length(y)))], collapse= " ")
})
}
arun <- function(str) {
keyword <- "moon"
lookaround <- 2
pattern <- paste0("([[:alpha:]]+ ){0,", lookaround, "}", keyword,
"( [[:alpha:]]+){0,", lookaround, "}")
regmatches(str, regexpr(pattern, str))
}
require(microbenchmark)
microbenchmark(t1 <- hong(ll), t2 <- arun(ll), times=10)
# Unit: seconds
# expr min lq median uq max neval
# t1 <- hong(ll) 6.172986 6.384981 6.478317 6.654690 7.193329 10
# t2 <- arun(ll) 1.175950 1.192455 1.200674 1.227279 1.326755 10
identical(t1, t2) # [1] TRUE