13

Given a regular expression containing capture groups (parentheses) and a string, how can I obtain all the substrings matching the capture groups, i.e., the substrings usually referenced by "\1", "\2"?

Example: consider a regex capturing digits preceded by "xy":

s <- "xy1234wz98xy567"

r <- "xy(\\d+)"

Desired result:

[1] "1234" "567" 

First attempt: gregexpr:

regmatches(s,gregexpr(r,s))
#[[1]]
#[1] "xy1234" "xy567" 

Not what I want because it returns the substrings matching the entire pattern.

Second try: regexec:

regmatches(s,regexec("xy(\\d+)",s))
#[[1]]
#[1] "xy1234" "1234" 

Not what I want because it returns only the first occurence of a matching for the entire pattern and the capture group.

If there was a gregexec function, extending regexec as gregexpr extends regexpr, my problem would be solved.

So the question is: how to retrieve all substrings (or indices that can be passed to regmatches as in the examples above) matching capture groups in an arbitrary regular expression?

Note: the pattern for r given above is just a silly example, it must remain arbitrary.

4

3 回答 3

12

For a base R solution, what about just using gsub() to finish processing the strings extracted by gregexpr() and regmatches()?

s <- "xy1234wz98xy567"
r <- "xy(\\d+)"

gsub(r, "\\1", regmatches(s,gregexpr(r,s))[[1]])
# [1] "1234" "567" 
于 2013-09-06T15:41:30.320 回答
11

不确定是否在基础上执行此操作,但这里有一个满足您需求的包:

library(stringr)

str_match_all(s, r)
#[[1]]
#     [,1]     [,2]  
#[1,] "xy1234" "1234"
#[2,] "xy567"  "567" 

许多stringr函数在基础 R 中也有相似之处,因此您也可以在不使用stringr.

例如,这是上述工作原理的简化版本,使用基础 R:

sapply(regmatches(s,gregexpr(r,s))[[1]], function(m) regmatches(m,regexec(r,m)))
于 2013-09-04T18:11:08.317 回答
8

strapplycgsubfn 包中这样做:

> library(gsubfn)
>
> strapplyc(s, r)
[[1]]
[1] "1234" "567" 

尝试?strapplyc获取更多信息和示例。

相关功能

1)的概括strapplycstrapply在同一个包中。它接受一个函数,该函数输入每个匹配的捕获部分并返回函数的输出。当函数为c时,它减少为strapplyc。例如,假设我们希望以数字形式返回结果:

> strapply(s, r, as.numeric)
[[1]]
[1] 1234  567

2) gsubfn是同一个包中的另一个相关函数。就像gsub除了替换字符串可以是替换函数(或替换列表或替换原型对象)。替换功能输入捕获的部分并输出替换。替换替换输入字符串中的匹配项。如果使用公式,如本例所示,则将公式的右侧视为函数体。在此示例中,我们将匹配替换为XY{#}# 是匹配的输入数字的两倍。

> gsubfn(r, ~ paste0("XY{", 2 * as.numeric(x), "}"), s)
[1] "XY{2468}wz98XY{1134}"

更新: 添加strapplygsubfn示例。

于 2013-09-05T03:24:33.640 回答