3

我想编辑字符串的地址,例如这个例子:

test = c("[Mavlyanova, Nadira G.] Uzbek Acad Sci, GA Mavlyanov Inst Seismol, Tashkent 700135, Uzbekistan; [Markovic, Slobodan B.] Univ Novi Sad, Fac Sci, Chair Phys Geog, Novi Sad 21000, Serbia; [Rowell, G.] Univ Adelaide, Sch Chem & Phys, Adelaide, SA 5005, Australia; [Katarzynski, K.] Nicholas Copernicus Univ, Torun Ctr Astron, PL-87100 Torun, Poland; [Ansari, Z.; Boettcher, M.; Manschwetus, B.; Rottke, H.; Sandner, W.] Max Born Inst, D-12489 Berlin, Germany; [Milosevic, D. B.] Univ Sarajevo, Fac Sci, Sarajevo 71000, Bosnia & Herceg")  

我只想得到国家名称。这是我到目前为止所尝试的:

> testa <- gsub("\\[.*?\\] ", "", test) #remove square brackets  
> testa <- strsplit(testa, ";", fixed = TRUE) #split adresses  
> testa <- sapply(testa, function(x) gsub("^.*, ([A-Za-z ]*)$", "\\1", x)) #keep only what's after last comma  
> testa <- gsub("^ | $", "", testa) #remove spaces  
> testa  
     [,1]  
[1,] "Uzbekistan"  
[2,] "Serbia"  
[3,] "Australia"  
[4,] "Poland"  
[5,] "Germany"  
[6,] "Univ Sarajevo, Fac Sci, Sarajevo 71000, Bosnia & Herceg"  

因此,不幸的是,这不适用于最后一个地址。我想获得以下输出:

> testa  
     [,1]                                                       
[1,] "Uzbekistan"  
[2,] "Serbia"  
[3,] "Australia"  
[4,] "Poland"  
[5,] "Germany"  
[6,] "Bosnia & Herceg"  

我的问题是:

  • 我的 sapply 函数中的错误是什么,导致它也无法正确使用最后一个地址?
  • 如何改进它以实现正确的输出?
4

4 回答 4

4

为什么不只是向后工作?

testa <- gsub("\\[.*?\\] ", "", test)
testa <- strsplit(testa, ";", fixed = TRUE)
# Remaining steps in question are unnecessary with the solution below

> sub(".+, ([A-Za-z& ]+)$","\\1",testa[[1]])
[1] "Uzbekistan"      "Serbia"          "Australia"       "Poland"          "Germany"         "Bosnia & Herceg"
于 2012-07-02T13:28:51.060 回答
2

您的代码的问题在于,您的代码的“最后一个逗号之后的所有内容”部分[A-Za-z ]用作之后唯一的有效字符。该集合不包括&,因此不会对最后一个地址执行替换。也许您应该使用[^,]“除逗号以外的任何内容”来表示。

于 2012-07-02T13:33:55.880 回答
1

这里已经有一些更好的答案,但我已经解决了这个问题,所以我想我仍然会发布:

y <- unlist(strsplit(test, "\\["))
y <- y[y!=""]
z <- sapply(y, function(x) strsplit(x, ","))
lens <- sapply(z, length)
a <- sapply(seq_along(z), function(i) z[[i]][lens[i]])
a <- gsub(";", "", a)
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)
Trim(a)
于 2012-07-02T13:35:07.903 回答
1

这是在 gsubfn 包中使用strapplyc(或者strapply也可以,但strapplyc在这里更快)的单行代码。首先附加 a ";"test然后搜索 a [(使用 regexp "\\[")后跟除[(使用 regexp "[^[]+")之外的任何字符的字符串,后跟逗号和空格 ( ", "),后跟除逗号、分号或[(使用regexp "([^,;[]+)") 后跟分号 ( ;) 并仅返回括号内的部分:

> library(gsubfn)
> strapplyc(paste0(test, ";"), "\\[[^[]+, ([^,;[]+);", simplify = TRUE)
     [,1]             
[1,] "Uzbekistan"     
[2,] "Serbia"         
[3,] "Australia"      
[4,] "Poland"         
[5,] "Germany"        
[6,] "Bosnia & Herceg"
于 2012-07-02T15:03:32.383 回答