regexp
我用,做这种事情perl=T
) (还没有找到提取捕获组的好方法):
m <- regexpr('^Name: *(.+), *College: (.+) *$',
vectorOfStrings, perl=T)
# m looks like this:
# [1] 1 1 1
# attr(,"match.length")
# [1] 31 46 53
# attr(,"useBytes")
# [1] TRUE
# attr(,"capture.start") # one column per capturing bracket,
# [1,] 7 24 # one row per entry in vectorOfStrings
# [2,] 7 33
# [3,] 7 34
# attr(,"capture.length")
# [1,] 6 8
# [2,] 15 14
# [3,] 16 20
# attr(,"capture.names")
# [1] "" ""
# laziness
st <- attr(m, 'capture.start')
en <- st + attr(m, 'capture.length') - 1
numCaptures <- ncol(st)
matches <- sapply(1:numCaptures, function (i) {
return(substr(vectorOfStrings, st[, i], en[, i]))
})
# matches
# [,1] [,2]
# [1,] "Andrew" "Bradford"
# [2,] "Charlie Daniels" "Easton College"
# [3,] "Frank Gehry, III" "Highlands University"
现在按摩matches
成您想要的形式。我通常将它包装在一个函数中,因为我经常使用它。
你甚至可以像这样使用 Perl 命名正则表达式:
m <- regexpr('^Name: *(?<name>.+), *College: (?<college>.+) *$',
vectorOfStrings, perl=T)
然后attr(m, 'capture.names')
将是c('name', 'college')
,colnames(attr(m, 'capture.(start or length)'))
也是c('name', 'college')
。
无论如何,密钥似乎正在使用perl=T
,否则regexpr
不会为每个捕获括号返回一组起点/终点。