regex - 正则表达式：用它的长度替换匹配的模式

Question

假设我有一个这样的字符串：

> x <- c("16^TG40")

我试图c(16 2 40)在2. length(^TG)-1例如，我可以通过以下方式找到这种模式：

> gsub("(\\^[ACGT]+)", " \\1 ", x)
[1] "16 ^TG 40"

但是，我无法length-1直接用它替换这个字符串。有没有更简单的方法用长度替换匹配的模式？

经过相当多的搜索（这里是 SO 和 google 搜索），我最终得到了stringr包，我认为这很棒。但是，这一切都归结为找到该模式的位置（使用str_locate_all），然后用任何想要的值替换子字符串（使用str_sub）。我有超过 100,000 个字符串，这非常耗时（因为该模式也可能在字符串中出现多次）。

我目前正在并行运行以补偿速度缓慢，但我很高兴知道这是否可能直接（或快速）。

有任何想法吗？

score 8 · Accepted Answer

(1) gsubfn该gsubfn语句将 ^... 部分替换为由空格包围的长度，并strapply从该字符串中提取数字并将它们转换为数字。省略strapplyif 字符输出就足够了。

> library(gsubfn)
> xx <- gsubfn("\\^[ACGT]*", ~ sprintf(" %s ", nchar(x) - 1), x)
> strapply(xx, "\\d+", as.numeric)
[[1]]
[1] 16  2 40

(2) 循环通过一组长度

这假设每个 ACGT 序列中的字符数介于 mn 和 mx 之间，并且它只是使用 gsub 在循环中进行替换 ACGT 序列 i long 为 i 。如果只有几个可能的长度，则只有少数迭代，所以它会很快，但如果字符串可能有许多不同的长度，它会很慢，因为需要更多的循环迭代。下面我们假设 ACGT 序列的长度为 2、4 或 6，但这些可能需要调整。该解决方案的一个可能缺点是需要假设一组可能的序列长度。

x <- "4^CG5^CAGT656"

mn <- 2
mx <- 6
y <- x
for(i in seq(mn, mx, 2)) {
   pat <- sprintf("\\^[ACGT]{%d}(\\d)", i)
   replacement <- sprintf(" %d \\1", i)
   y <- gsub(pat, replacement, y)
}

(3) 循环遍历 ACGT 序列

这个循环遍历 ACGT 序列，用它的长度替换一个，直到一个都没有。如果有少量的 ACGT 序列，它可能会很快，因为会发生很少的迭代，但如果有很多 ACGT 序列，由于迭代次数较多，它会很慢。

x <- "4^CG5^CAGT656"
y <- x
while(regexpr("^", y, fixed = TRUE) > 0) {
    y <- sprintf("%s %d %s", sub("\\^.*", "", y),
        nchar(sub("^[0-9 ]+\\^([ACGT]+).*", "\\1", y)),
        sub("^[0-9 ]+\\^[ACGT]+", "", y))
}

基准

这是一个基准。请注意，在上面的一些解决方案中，我将字符串转换为数字（这当然需要额外的时间），但为了使基准具有可比性，我比较了在没有任何数字转换的情况下创建字符串的速度。

x <- "4^CGT5^CCA656"
library(rbenchmark)
benchmark(order = "relative", replications = 10000,
   columns = c("test", "replications", "relative", "elapsed"),
   regmatch = {
      pat <- "(\\^[ACGT]+)"
      x2 <- x
      m <- gregexpr(pat, x2)
      regmatches(x2, m) <- sapply(regmatches(x2, m), modFun)
      x2
   },
   gsubfn = gsubfn("\\^[ACGT]*", ~ sprintf(" %s ", length(x) - 1), x),
   loop.on.len = {
    mn <- 2
    mx <- 6
    y <- x
    for(i in seq(mn, mx, 2)) {
       pat <- sprintf("\\^[ACGT]{%d}(\\d)", i)
       replacement <- sprintf(" %d \\1", i)
       y <- gsub(pat, replacement, y)
    }
   },
   loop.on.seq = {
    y <- x
    while(regexpr("^", y, fixed = TRUE) > 0) {
        y <- sprintf("%s %d %s", sub("\\^.*", "", y),
            nchar(sub("^[0-9 ]+\\^([ACGT]+).*", "\\1", y)),
            sub("^[0-9 ]+\\^[ACGT]+", "", y))
    }
  }
)

结果如下所示。这两个循环解决方案在所示输入上是最快的，但它们的性能会根据需要多少次迭代而有所不同，因此实际数据可能会有所不同。loop.on.len 解决方案的缺点是 ACGT 长度必须在假设集合中。来自 Josh 的 regmatch 解决方案不涉及循环，而且速度很快。gsubfn 解决方案的优点是它只有一行代码并且特别直接。

         test replications relative elapsed
4 loop.on.seq        10000    1.000    1.93
3 loop.on.len        10000    1.140    2.20
1    regmatch        10000    1.803    3.48
2      gsubfn        10000    7.145   13.79

更新添加了两个循环解决方案，并删除了之前帖子中不处理多个 ACGT 序列的解决方案（基于澄清问题的评论）。还重新进行了基准测试，仅包括处理多个 ACGT 序列的解决方案。

更新删除了一个不适用于多个 ^... 序列的解决方案。它以前已从基准测试中删除，但代码尚未删除。改进了（1）中的解释。

score 8 · Accepted Answer

这是一个base-R方法。

语法远非直观，但通过密切关注此模板，您可以执行所有方式的操作和匹配子字符串的替换。（请参阅?gregexpr一些更复杂的示例。）

x2 <- x <- c("16^TG40", "16^TGCT40", "16^TG40^GATTACA40")

pat <- "(\\^[ACGT]+)"              ## A pattern matching substrings of interest
modFun <- function(ss) {           ## A function to modify them
    paste0(" ", nchar(ss) - 1, " ")
}

## Use regmatches() <- regmatches(gregexpr()) to search, modify, and replace.
m <- gregexpr(pat, x2)
regmatches(x2, m) <- sapply(regmatches(x2, m), modFun)
x2
## [1] "16 2 40"      "16 4 40"      "16 2 40 7 40"

score 2 · Accepted Answer

我投票支持令人难以置信的巧妙gsubfn答案，但因为我已经有了这个笨重的代码：

mod <- gsub("(\\^[ACGT]+)", " \\1 ", x)
locs <- gregexpr(" ", mod , fixed=TRUE)[[1]]
paste( substr( x, 1, locs[1]-1), 
       diff(locs)-2, 
       substr(mod, locs[2]+1, nchar(mod) ) , sep=" ")
#[1] "16 2 40"

regex - 正则表达式：用它的长度替换匹配的模式

3 回答 3

Related

Reference