r - In R, split a character vector by a specific character; save 3rd piece in new vector

Question

I have a vector of data in the form ‘aaa_9999_1’ where the first part is an alpha-location code, the second is the four digit year, and the final is a unique point identifier. E.g., there are multiple sil_2007_X points, each with a different last digit. I need to split this field, using the “_” character and save only the unique ID number into a new vector. I tried:

oss$point <- unlist(strsplit(oss$id, split='_', fixed=TRUE))[3]

based on a response here: R remove part of string. I get a single response of “1”. If I just run

strsplit(oss$id, split= ‘_’, fixed=TRUE)

I can generate the split list:

> head(oss$point)
[[1]]
[1] "sil"  "2007" "1"   

[[2]]
[1] "sil"  "2007" "2"   

[[3]]
[1] "sil"  "2007" "3"   

[[4]]
[1] "sil"  "2007" "4"   

[[5]]
[1] "sil"  "2007" "5"   

[[6]]
[1] "sil"  "2007" "6"

Adding the [3] at the end just gives me the [[3]] result: “sil” “2007” “3”. What I want is a vector of the 3rd part (the unique number) of all records. I feel like I’m close to understanding this, but it is taking too much time (like most of a day) on a deadline project. Thanks for any feedback.

score 16 · Accepted Answer

strsplit创建一个列表，所以我会尝试以下操作：

lapply(strsplit(oss$id, split='_', fixed=TRUE), `[`, 3) ## Output a list
sapply(strsplit(oss$id, split='_', fixed=TRUE), `[`, 3) ## Output a vector (even though a list is also a vector)

[提取第三个元素的方法。如果您更喜欢向量，请替换lapply为sapply.

这是一个例子：

mystring <- c("A_B_C", "D_E_F")

lapply(strsplit(mystring, "_"), `[`, 3)
# [[1]]
# [1] "C"
# 
# [[2]]
# [1] "F"
sapply(strsplit(mystring, "_"), `[`, 3)
# [1] "C" "F"

如果有一个易于定义的模式，gsub也可能是一个不错的选择，并且避免分裂。请参阅 DWin 和 Josh O'Brien 对改进（更强大）版本的评论。

gsub(".*_.*_(.*)", "\\1", mystring)
# [1] "C" "F"

最后，为了好玩，您可以扩展该unlist方法以使其工作，通过循环使用TRUEs 和FALSEs 的向量来提取每三个项目（因为我们事先知道所有拆分都会产生相同的结构）。

unlist(strsplit(mystring, "_"), use.names = FALSE)[c(FALSE, FALSE, TRUE)]
# [1] "C" "F"

如果您不是按数字位置提取，而只是想提取分隔符后的最后一个值，那么您有几种不同的选择。

使用贪婪的正则表达式：

gsub(".*_(.*)", "\\1", mystring)
# [1] "C" "F"

使用stri_extract*“stringi”包中的便利函数：

library(stringi)
stri_extract_last_regex(mystring, "[A-Z]+")
# [1] "C" "F"

score 0 · Accepted Answer

这是你需要的吗？

x = c('aaa_9999_12', 'bbb_9999_20')
ids = sapply(x, function(v){strsplit(v, '_')[[1]][3]}, USE.NAMES = FALSE)

# optional
# ids = as.numeric(ids)

这是非常低效的，可能有更好的方法。

r - In R, split a character vector by a specific character; save 3rd piece in new vector

2 回答 2

Related

Reference