regex - str_extract_all 返回非匹配组

Question

我正在尝试使用包中的 R 中的某些文本提取值str_extract_all，stringr并且我想使用 perl 的正则表达式中的非匹配组(?:...)来提取和清理一行中的相关值。

运行此代码时：

library(stringr)

## Example string.
## Not the real string, but I get the same results with this one.
x <- 'WIDTH 4\nsome text that should not be matched.\n\nWIDTH   46 some text.'

## extract values
str_extract_all(x, perl('(?:WIDTH\\s+)[0-9]+'))

我想得到这个结果：

[[1]]
[1] "4"    "46"

但我明白了：

[[1]]
[1] "WIDTH 4"    "WIDTH   46"

我究竟做错了什么？

score 5 · Accepted Answer

正则表达式仍然匹配WIDTH- 它只是没有将其放入捕获组。你的正则表达式相当于

WIDTH\s+[0-9]+

您的代码提取与正则表达式匹配的整个子字符串。（非）捕获组不会改变这一点。

您可以使用lookbehind 断言某个字符串出现在当前位置之前，而无需将其包含在匹配的子字符串中：

(?<=WIDTH\s)[0-9]+

根据确切的正则表达式引擎，您不能在后视中使用可变长度模式。还有另一种形式可以允许这样做：

WIDTH\s+\K[0-9]+

score 2 · Accepted Answer

perl 零宽度正则表达式是错误的。

以下是不需要 perl 正则表达式的解决方案：

sub("WIDTH\\s+", "", str_extract_all(x, 'WIDTH\\s+[0-9]+')[[1]])

或更简单：

library(gsubfn)
strapplyc(x, "WIDTH\\s+(\\d+)")

此外，如果我们希望将结果返回为数字，则可以：

strapply(x, "WIDTH\\s+(\\d+)", as.numeric)

regex - str_extract_all 返回非匹配组

2 回答 2

Related

Reference