0

I have a data frame sp which contains several species names but as they come from different databases, they are written in different ways.

For example, one specie can be called Urtica dioica and Urtica dioica L..

To correct this, I use the following code which extracs only the two first words from a row:

paste(strsplit(sp[i,"sp"]," ")[[1]][1],strsplit(sp[i,"sp"]," ")[[1]][2],sep=" ")

For now, this code is integrated in a for loop, which works but takes ages to finish:

for (i in seq_along(sp$sp)) {
    sp[i,"sp2"] = paste(strsplit(sp[i,"sp"]," ")[[1]][1],
                        strsplit(sp[i,"sp"]," ")[[1]][2],
                        sep=" ")
}

If there a way to improve this basic code using vectors or an apply function?

4

3 回答 3

1

您可以只使用矢量化正则表达式函数:

library(stringr)
x <- c("Urtica dioica", "Urtica dioica L.")
> str_extract(string = x,"\\w+ \\w+")
[1] "Urtica dioica" "Urtica dioica"

我碰巧在这里发现stringr很方便,但是对于您的特定数据使用正确的正则表达式,您可以使用基本函数(如gsub.

于 2014-07-24T15:29:47.197 回答
0

在每次提取之前,您可能需要检查字符串中是否有超过 2 个单词:

if((sapply(gregexpr("\\W+", i), length) + 1) > 2){
    ...
}
于 2014-07-24T15:16:53.327 回答
0

有一个功能。

也从stringr,word函数

> choices <- c("Urtica dioica", "Urtica dioica L..") 
> library(stringr)
> word(choices, 1:2)
# [1] "Urtica" "dioica"
> word(choices, rep(1:2, 2))
# [1] "Urtica" "dioica" "Urtica" "dioica"

这些返回单独的字符串。对于包含名字和姓氏的两个字符串,

> word(choices, 1, 2)
# [1] "Urtica dioica" "Urtica dioica"

最后一行从向量中的每个字符串中获取前两个单词choices

于 2014-07-24T15:59:12.747 回答