3

假设我有一个字符串向量,如下所示:

vectorOfStrings <- c("Name: Andrew, College: Bradford",
                     "Name: Charlie Daniels, College: Easton College",
                     "Name: Frank Gehry, III, College: Highlands University")

其中有明显重复的“姓名:”、“学院:”模式。

我想生成一个如下所示的列表(或 data.frame):

listOfValues <- list(c("Andrew", "Charlie Daniels", "Frank Gehry, III"),
                     c("Bradford", "Easton College", "Highlands University"))

vectorOfStringsto最直接的方法是listOfValues什么?我相当熟悉base字符串操作函数以及stringr,但我想这是一种相对常见的情况,并希望有一个相对完善的解决方案。

提前致谢。

4

5 回答 5

4

以下是两种可能的解决方案:

(1) strapplycmat语句创建一个矩阵,其第一列包含名称,第二列包含学院。然后我们在最后一条语句中将其转换为未命名列表:

library(gsubfn)

pat <- "Name: (.*), College: (.*)"
mat <- strapplyc(vectorOfStrings, pat, simplify = rbind)

unname(as.list(as.data.frame(mat, stringsAsFactors = FALSE)))

(2) gsub/read.table 一个只使用纯 R 的变体是使用上面的gsubwithpat将每个输入字符串转换为包含数据但不包含标签的管道分隔字符串。读取它read.table会给出一个数据框,DF. 最后,我们转换DF为一个无名列表:

g <- gsub(pat, "\\1|\\2", vectorOfStrings)
DF <- read.table(text = g, sep = "|", as.is = TRUE)

unname(as.list(DF))

添加:第二种解决方案

于 2013-02-26T01:50:23.673 回答
3

我喜欢数学咖啡的想法,但既然我已经写好了,这里还有另一种可能性:

X <- strsplit(vectorOfStrings, ",\\s*(?=College:)", perl=TRUE)
do.call(rbind, lapply(X, function(X) gsub("(Name|College):\\s*", "", X)))
#      [,1]               [,2]                  
# [1,] "Andrew"           "Bradford"            
# [2,] "Charlie Daniels"  "Easton College"      
# [3,] "Frank Gehry, III" "Highlands University"
于 2013-02-26T01:30:46.607 回答
2
  do.call(rbind, strsplit(unlist(
            strsplit(vectorOfStrings, "Name: ")), ", College: "))

       [,1]               [,2]                  
  [1,] "Andrew"           "Bradford"            
  [2,] "Charlie Daniels"  "Easton College"      
  [3,] "Frank Gehry, III" "Highlands University"


似乎已经有很多好的答案了。与@Josh O'Brien 类似,我会使用 strsplit。

由于您没有保留"Name"and "College",因此您可以直接拆分它。然后,您只需将其包装在 ado.call(rbind, ___) 中,它将自动删除拆分创建的任何空字符串。

于 2013-02-26T05:18:58.450 回答
1

regexp我用,做这种事情perl=T) (还没有找到提取捕获组的好方法):

m <- regexpr('^Name: *(.+), *College: (.+) *$',
             vectorOfStrings, perl=T)
# m looks like this:
# [1] 1 1 1
# attr(,"match.length")
# [1] 31 46 53
# attr(,"useBytes")
# [1] TRUE
# attr(,"capture.start")  # one column per capturing bracket,   
# [1,] 7 24               # one row per entry in vectorOfStrings
# [2,] 7 33
# [3,] 7 34
# attr(,"capture.length")    
# [1,]  6  8
# [2,] 15 14
# [3,] 16 20
# attr(,"capture.names")
# [1] "" ""

# laziness
st <- attr(m, 'capture.start')
en <- st + attr(m, 'capture.length') - 1
numCaptures <- ncol(st)

matches <- sapply(1:numCaptures, function (i) {
    return(substr(vectorOfStrings, st[, i], en[, i]))
})

# matches
#     [,1]               [,2]                  
# [1,] "Andrew"           "Bradford"            
# [2,] "Charlie Daniels"  "Easton College"      
# [3,] "Frank Gehry, III" "Highlands University"

现在按摩matches成您想要的形式。我通常将它包装在一个函数中,因为我经常使用它。

你甚至可以像这样使用 Perl 命名正则表达式:

m <- regexpr('^Name: *(?<name>.+), *College: (?<college>.+) *$',
             vectorOfStrings, perl=T)

然后attr(m, 'capture.names')将是c('name', 'college')colnames(attr(m, 'capture.(start or length)'))也是c('name', 'college')

无论如何,密钥似乎正在使用perl=T,否则regexpr不会为每个捕获括号返回一组起点/终点。

于 2013-02-26T01:23:51.917 回答
1

使用反向引用可能更简单

> vectorOfStrings
[1] "Name: Andrew, College: Bradford"                       "Name: Charlie Daniels, College: Easton College"       
[3] "Name: Frank Gehry, III, College: Highlands University"
> list(gsub('^Name:(.*), College:(.*)$',"\\1", vectorOfStrings) , gsub('^Name:(.*), College:(.*)$',"\\2", vectorOfStrings))
[[1]]
[1] " Andrew"           " Charlie Daniels"  " Frank Gehry, III"

[[2]]
[1] " Bradford"             " Easton College"       " Highlands University"
于 2013-02-26T01:36:32.117 回答