regex - 为什么 strsplit 使用正向前瞻和后向断言匹配不同？

Question

常识和完整性检查使用gregexpr()表明，下面的后视和前瞻断言应分别在中的一个位置匹配testString：

testString <- "text XX text"
BB  <- "(?<= XX )"
FF  <- "(?= XX )"

as.vector(gregexpr(BB, testString, perl=TRUE)[[1]])
# [1] 9
as.vector(gregexpr(FF, testString, perl=TRUE)[[1]][1])
# [1] 5

strsplit()然而，使用这些匹配位置的方式不同，使用后向断言时testString在一个位置拆分，但在使用前瞻断言时在两个位置 - 其中第二个位置似乎不正确。

strsplit(testString, BB, perl=TRUE)
# [[1]]
# [1] "text XX " "text"    

strsplit(testString, FF, perl=TRUE)
# [[1]]
# [1] "text"    " "       "XX text"

我有两个问题：（Q1）这是怎么回事？以及（Q2）如何才能strsplit()表现得更好？

更新： Theodore Lytras 的出色回答解释了发生了什么，因此地址(Q1)。我的答案建立在他确定补救措施的基础上，地址为(Q2)。

score 29 · Accepted Answer

我不确定这是否属于错误，因为我相信这是基于 R 文档的预期行为。来自?strsplit：

应用于每个输入字符串的算法是
repeat {
    if the string is empty
        break.
    if there is a match
        add the string to the left of the match to the output.
        remove the match and all to the left of it.
    else
        add the string to the output.
        break.
}
请注意，这意味着如果在（非空）字符串的开头有匹配项，则输出的第一个元素是 '""'，但如果在字符串的末尾有匹配项，则输出为与删除匹配相同。

问题是前瞻（和后瞻）断言的长度为零。因此，例如在这种情况下：

FF <- "(?=funky)"
testString <- "take me to funky town"

gregexpr(FF,testString,perl=TRUE)
# [[1]]
# [1] 12
# attr(,"match.length")
# [1] 0
# attr(,"useBytes")
# [1] TRUE

strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "take me to " "f"           "unky town"

发生的情况是，孤独的前瞻(?=funky)在位置 12 匹配。因此，第一个拆分包括直到位置 11（匹配左侧）的字符串，并且它与匹配一起从字符串中删除，但是匹配的长度为零.

现在剩下的字符串是funky town，并且前瞻匹配位置 1。但是没有什么可以删除，因为匹配的左侧没有任何内容，并且匹配本身的长度为零。所以算法陷入了无限循环。显然，R 通过拆分单个字符来解决这个问题，顺便说一下，当strsplit使用空正则表达式（当 argument 时split=""）时，这是记录在案的行为。在此之后，剩余的字符串是unky town，由于没有匹配项，它作为最后一个拆分返回。

Lookbehinds 没有问题，因为每个匹配项都被拆分并从剩余的字符串中删除，因此算法永远不会卡住。

诚然，这种行为乍一看很奇怪。然而，否则行为将违反前瞻零长度的假设。鉴于该strsplit算法已记录在案，我相信这不符合错误的定义。

score 17 · Accepted Answer

基于 Theodore Lytras 对substr()' 行为的仔细解释，一个相当干净的解决方法是在要匹配的前瞻断言前面加上一个匹配任何单个字符的肯定后向断言：

testString <- "take me to funky town"
FF2 <- "(?<=.)(?=funky)"
strsplit(testString, FF2, perl=TRUE)
# [[1]]
# [1] "take me to " "funky town"

score 5 · Accepted Answer

对我来说似乎是一个错误。这似乎不仅仅与空间有关，特别是，而是任何孤独的前瞻（正面或负面）：

FF <- "(?=funky)"
testString <- "take me to funky town"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "take me to " "f"           "unky town"  

FF <- "(?=funky)"
testString <- "funky take me to funky funky town"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "f"                "unky take me to " "f"                "unky "           
# [5] "f"                "unky town"       


FF <- "(?!y)"
testString <- "xxxyxxxxxxx"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "xxx"       "y"       "xxxxxxx"

如果给定一些要捕获的内容以及零宽度断言，似乎可以正常工作，例如：

FF <- " (?=XX )"
testString <- "text XX text"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "text"    "XX text"

FF <- "(?= XX ) "
testString <- "text XX text"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "text"    "XX text"

也许类似的东西可以作为一种解决方法。

regex - 为什么 strsplit 使用正向前瞻和后向断言匹配不同？

3 回答 3

Related

Reference