regex - R gregexpr 上的正则表达式匹配

Question

我正在尝试计算 3 个连续“a”事件的实例，"aaa".

该字符串将包含较低的字母，例如"abaaaababaaa"

我尝试了以下代码。但这种行为并不是我想要的。

x<-"abaaaababaaa";
gregexpr("aaa",x);

我希望匹配返回 3 个“aaa”实例，而不是 2 个。

假设索引从 1 开始

“aaa”的第一次出现在索引 3 处。
“aaa”的第二次出现在索引 4 处。（这未被 gregexpr 捕获）
“aaa”第三次出现在索引 10 处。

score 6 · Accepted Answer

要捕获重叠匹配，您可以使用如下所示的前瞻：

gregexpr("a(?=aa)", x, perl=TRUE)

但是，您的匹配现在只是一个“a”，因此可能会使这些匹配的进一步处理复杂化，特别是如果您并不总是寻找固定长度的模式。

score 1 · Accepted Answer

我知道我迟到了，但我想分享这个解决方案，

your.string <- "abaaaababaaa"
nc1 <- nchar(your.string)-1
x <- unlist(strsplit(your.string, NULL))
x2 <- c()
for (i in 1:nc1)
x2 <- c(x2, paste(x[i], x[i+1], x[i+2], sep="")) 
cat("ocurrences of <aaa> in <your.string> is,", 
    length(grep("aaa", x2)), "and they are at index", grep("aaa", x2))
> ocurrences of <aaa> in <your.string> is, 3 and they are at index 3 4 10

受到Fran 的 R-help 的这个回答的极大启发。

score 0 · Accepted Answer

这是一种使用提取所有不同长度的重叠匹配的方法gregexpr。

x<-"abaaaababaaa"
# nest in lookahead + capture group
# to get all instances of the pattern "(ab)|b"
matches<-gregexpr('(?=((ab)|b))', x, perl=TRUE)
# regmatches will reference the match.length attr. to extract the strings
# so move match length data from 'capture.length' to 'match.length' attr
attr(matches[[1]], 'match.length') <- as.vector(attr(matches[[1]], 'capture.length')[,1])
# extract substrings
regmatches(x, matches)
# [[1]]
# [1] "ab" "b"  "ab" "b"  "ab" "b"

诀窍是将模式包围在一个捕获组中，并将该捕获组包围在一个前瞻断言中。gregexpr将返回一个列表，其中包含带有属性的起始位置capture.length，一个矩阵，其中第一列是第一个捕获组的匹配长度。如果将其转换为向量并将其移动到match.length属性中（全为零，因为整个模式都在前瞻断言中），则可以将其传递给以regmatches提取字符串。

正如最终结果的类型所暗示的那样，通过一些修改，这可以被向量化，对于x字符串列表的情况。

x<-list(s1="abaaaababaaa", s2="ab")
matches<-gregexpr('(?=((ab)|b))', x, perl=TRUE)
# make a function that replaces match.length attr with capture.length
set.match.length<-
function(x) structure(x, match.length=as.vector(attr(x, 'capture.length')[,1]))
# set match.length to capture.length for each match object
matches<-lapply(matches, set.match.length)
# extract substrings
mapply(regmatches, x, lapply(matches, list))
# $s1
# [1] "ab" "b"  "ab" "b"  "ab" "b" 
# 
# $s2
# [1] "ab" "b"

regex - R gregexpr 上的正则表达式匹配

3 回答 3

Related

Reference