regex - strsplit 与 gregexpr 不一致

Question

对我对这个问题的回答的评论应该给出预期的结果strsplit，即使它似乎正确匹配字符向量中的第一个和最后一个逗号。这可以使用gregexpr和来证明regmatches。

那么为什么strsplit在这个例子中对每个逗号进行拆分，即使只为同一个regmatches正则表达式返回两个匹配项？

#  We would like to split on the first comma and
#  the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"

#  Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34"  "56"  "78"  "90" 


#  Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )

# Matching positions are at
unlist(m)
[1]  4 13

#  And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","

咦？！到底是怎么回事？

score 10 · Accepted Answer

@Aprillion 的理论是准确的，来自R 文档：

应用于每个输入字符串的算法是

repeat {
    if the string is empty
        break.
    if there is a match
        add the string to the left of the match to the output.
        remove the match and all to the left of it.
    else
        add the string to the output.
        break.
}

换句话说，每次迭代^都会匹配一个新字符串的开头（没有前面的项目。）

为了简单地说明这种行为：

> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""

在这里，您可以使用前瞻断言作为分隔符来查看此行为的结果（感谢@JoshO'Brien 提供链接。）

regex - strsplit 与 gregexpr 不一致

1 回答 1

Related

Reference