regex - 括号内的正则表达式匹配

Question

我正在尝试使用我为 Python 制作的一些正则表达式也适用于 R。

这是我在 Python 中所拥有的（使用出色的re模块），以及我预期的 3 个匹配项：

import re
line = 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
re.findall('"(.*?)"', line)
# ['First [T]', 'Second [L]', 'Third [1/T]']

现在有了 R，这是我最好的尝试：

line <- 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
m <- gregexpr('"(.*?)"', line)
regmatches(line, m)[[1]]
# [1] "\"First [T]\""   "\"Second [L]\""  "\"Third [1/T]\""

为什么 R 匹配整个模式，而不仅仅是在括号内？我期待：

[1] "First [T]"   "Second [L]"  "Third [1/T]"

此外，perl=TRUE没有任何区别。假设 R 的正则表达式不考虑仅匹配括号是否安全，或者我是否缺少一些技巧？

解决方案摘要：感谢@flodel，它似乎也适用于其他模式，因此它似乎是一个很好的通用解决方案。使用输入字符串line和正则表达式模式的解决方案的紧凑形式pat是：

pat <- '"(.*?)"'
sub(pat, "\\1", regmatches(line, gregexpr(pat, line))[[1]])

此外，perl=TRUE如果gregexpr在pat.

score 3 · Accepted Answer

如果你打印m，你会看到gregexpr(..., perl = TRUE)为你提供匹配的位置和长度a）你的完整模式，包括前导和结束引号和b）捕获的(.*).

不幸的是，当m被使用时regmatches，它使用前者的位置和长度。

我能想到两种解决方案。

通过您的最终输出sub：

line <- 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
m <- gregexpr('"(.*?)"', line, perl = TRUE)
z <- regmatches(line, m)[[1]]
sub('"(.*?)"', "\\1", z)

或者使用substring捕获表达式的位置和长度：

start.pos <- attr(m[[1]], "capture.start")
end.pos   <- start.pos + attr(m[[1]], "capture.length") - 1L
substring(line, start.pos, end.pos)

为了加深你的理解，看看如果你的模式试图捕捉不止一件事会发生什么。另请参阅您可以在此处"capture1"和"capture2"：

m <- gregexpr('"(?P<capture1>.*?) \\[(?P<capture2>.*?)\\]"', line, perl = TRUE)
m

start.pos <- attr(m[[1]], "capture.start")
end.pos   <- start.pos + attr(m[[1]], "capture.length") - 1L

substring(line, start.pos[, "capture1"],
                  end.pos[, "capture1"])
# [1] "First"  "Second" "Third" 

substring(line, start.pos[, "capture2"],
                  end.pos[, "capture2"])
# [1] "T"   "L"   "1/T"

score 2 · Accepted Answer

1） strapplyc在gsubfn 包中以您期望的方式运行：

> library(gsubfn)
> strapplyc(line, '"(.*?)"')[[1]]
[1] "First [T]"   "Second [L]"  "Third [1/T]"

2）虽然它涉及深入研究m's 的属性，但它可以regmatches通过重构m来引用捕获而不是整个匹配来完成工作：

at <- attributes( m[[1]] )
m2 <- list( structure( c(at$capture.start), match.length = at$capture.length ) )

regmatches( line, m2 )[[1]]

3）如果我们知道字符串总是以结尾]并愿意修改正则表达式，那么这将起作用：

> m3 <- gregexpr('[^"]*]', line)
> regmatches( line, m3 )[[1]]
[1] "First [T]"   "Second [L]"  "Third [1/T]"

regex - 括号内的正则表达式匹配

2 回答 2

Related

Reference