r - 在 R 中，为每一行找到包含字符串的列

Question

我一定是在考虑错误的搜索词，因为我不敢相信我的问题是独一无二的，但我只找到了一个类似的问题。

我有一些来自世界银行的相当笨重的数据，它是一个代表数据库的平面文件。数据是每行一个项目，但每个项目都有多个特征，这些特征方便地在名称为“SECTOR.1”的列中，在其他名称为“SECTOR.1.PCT”等的列中具有自己的特征。

从中，我试图提取与特定类型的 SECTOR 相关的数据，但我仍然需要拥有所有其他项目信息。

我已经能够朝着正确的方向迈出一些步骤，从我在 SO 上找到的另一个问题：Find the index of the column in data frame that contains the string as value

根据上面的问题说明，一个最小的可重现示例在这里：

> df <- data.frame(col1 = c(letters[1:4],"c"), 
...                  col2 = 1:5, 
...                  col3 = c("a","c","l","c","l"), 
...                  col4= letters[3:7])
> df
  col1 col2 col3 col4
1    a    1    a    c
2    b    2    c    d
3    c    3    l    e
4    d    4    c    f
5    c    5    l    g

我想要的输出是这样的：

1 col4
2 col3
3 col1
4 col3
5 col1

我知道我可以做一个 ifelse，但这似乎不是一个非常优雅的方法。当然，因为这是我只会做 1 次的事情（对于这个项目），错别字的风险很小。例如，

> df$hasc <- ifelse(grepl("c",df$col1), "col1",
...                         ifelse(grepl("c",df$col2), "col2",
...                                ifelse(grepl("c",df$col3), "col3",
...                                       ifelse(grepl("c",df$col4), "col4",
...                                              NA))))
> df
  col1 col2 col3 col4 hasc
1    a    1    a    c col4
2    b    2    c    d col3
3    c    3    l    e col1
4    d    4    c    f col3
5    c    5    l    g col1

我认为如果我有某种可以逐行查看列的应用函数会更好。上一个问题中的方法不适用于这个问题，因为我需要知道哪一列有“c”。除了列出带有“c”的列名之外，我得到了一些没有意义的东西。我不明白 1,3,4 因为它不对应于行名或计数：

>which(apply(df, 2, function(x) any(grepl("c", x))))
col1 col3 col4 
1    3    4

而且，如果我尝试按行执行，我确实看到每一行都有一个“c”，正如预期的那样。

 >which(apply(df, 1, function(x) any(grepl("c", x))))
[1] 1 2 3 4 5

ALSO -> 我想知道是否有一种方法可以解决这个问题，如果一行中有多个列中有“c”，例如，如果我们有：

> df <- data.frame(col1 = c(letters[1:4],"c"), 
...                  col2 = 1:5, 
...                  col3 = c("a","c","l","c","c"), 
...                  col4= letters[3:7])
> df
  col1 col2 col3 col4
1    a    1    a    c
2    b    2    c    d
3    c    3    l    e
4    d    4    c    f
5    c    5    c    g

然后我的 ifelse 方法失败了，因为它只为 row5 提供了“col1”。

score 5 · Accepted Answer

假设数据集'df'的每一行中都有一个'c'，我们可以使用max.col获取行元素为'c'的列索引，并使用它来获取匹配的列名。

df$hasc <- colnames(df)[max.col(df=='c')]
df
#  col1 col2 col3 col4 hasc
#1    a    1    a    c col4
#2    b    2    c    d col3
#3    c    3    l    e col1
#4    d    4    c    f col3
#5    c    5    l    g col1

如果每行有多个“c”，一种选择是将行和paste多个列名一起循环

df$hasc <- apply(df=='c', 1, FUN= function(x) toString(names(x)[x]))

score 2 · Accepted Answer

多匹配情况的替代方案，它可能比运行快一点apply：

tmp <- which(df=="c", arr.ind=TRUE)
cnt <- ave(tmp[,"row"], tmp[,"row"], FUN=seq_along)
maxnames <- paste0("max",sequence(max(cnt)))
df[maxnames] <- NA
df[maxnames][cbind(tmp[,"row"],cnt)] <- names(df)[tmp[,"col"]]

#  col1 col2 col3 col4 max1 max2
#1    a    1    a    c col4 <NA>
#2    b    2    c    d col3 <NA>
#3    c    3    l    e col1 <NA>
#4    d    4    c    f col3 <NA>
#5    c    5    c    g col1 col3

r - 在 R 中，为每一行找到包含字符串的列

2 回答 2

Related

Reference