5

I have a vector of data in the form ‘aaa_9999_1’ where the first part is an alpha-location code, the second is the four digit year, and the final is a unique point identifier. E.g., there are multiple sil_2007_X points, each with a different last digit. I need to split this field, using the “_” character and save only the unique ID number into a new vector. I tried:

oss$point <- unlist(strsplit(oss$id, split='_', fixed=TRUE))[3]

based on a response here: R remove part of string. I get a single response of “1”. If I just run

strsplit(oss$id, split= ‘_’, fixed=TRUE)

I can generate the split list:

> head(oss$point)
[[1]]
[1] "sil"  "2007" "1"   

[[2]]
[1] "sil"  "2007" "2"   

[[3]]
[1] "sil"  "2007" "3"   

[[4]]
[1] "sil"  "2007" "4"   

[[5]]
[1] "sil"  "2007" "5"   

[[6]]
[1] "sil"  "2007" "6"  

Adding the [3] at the end just gives me the [[3]] result: “sil” “2007” “3”. What I want is a vector of the 3rd part (the unique number) of all records. I feel like I’m close to understanding this, but it is taking too much time (like most of a day) on a deadline project. Thanks for any feedback.

4

2 回答 2

16

strsplit创建一个列表,所以我会尝试以下操作:

lapply(strsplit(oss$id, split='_', fixed=TRUE), `[`, 3) ## Output a list
sapply(strsplit(oss$id, split='_', fixed=TRUE), `[`, 3) ## Output a vector (even though a list is also a vector)

[提取第三个元素的方法。如果您更喜欢向量,请替换lapplysapply.

这是一个例子:

mystring <- c("A_B_C", "D_E_F")

lapply(strsplit(mystring, "_"), `[`, 3)
# [[1]]
# [1] "C"
# 
# [[2]]
# [1] "F"
sapply(strsplit(mystring, "_"), `[`, 3)
# [1] "C" "F"

如果有一个易于定义的模式,gsub也可能是一个不错的选择,并且避免分裂。请参阅 DWin 和 Josh O'Brien 对改进(更强大)版本的评论。

gsub(".*_.*_(.*)", "\\1", mystring)
# [1] "C" "F"

最后,为了好玩,您可以扩展该unlist方法以使其工作,通过循环使用TRUEs 和FALSEs 的向量来提取每三个项目(因为我们事先知道所有拆分都会产生相同的结构)。

unlist(strsplit(mystring, "_"), use.names = FALSE)[c(FALSE, FALSE, TRUE)]
# [1] "C" "F"

如果您不是按数字位置提取,而只是想提取分隔符后的最后一个值,那么您有几种不同的选择。

使用贪婪的正则表达式:

gsub(".*_(.*)", "\\1", mystring)
# [1] "C" "F"

使用stri_extract*“stringi”包中的便利函数:

library(stringi)
stri_extract_last_regex(mystring, "[A-Z]+")
# [1] "C" "F"
于 2013-10-16T17:47:12.127 回答
0

这是你需要的吗?

x = c('aaa_9999_12', 'bbb_9999_20')
ids = sapply(x, function(v){strsplit(v, '_')[[1]][3]}, USE.NAMES = FALSE)

# optional
# ids = as.numeric(ids)

这是非常低效的,可能有更好的方法。

于 2013-10-16T17:48:50.090 回答