r - 如何找到 R 中每个因子的最小行数？

Question

我有一个数据框，称为 A，它看起来像这样：

GroupID  Dist1   Dist2 ...
1        4       4 
1        5       4 
1        3       16 
2        0       4 
2        7       2 
2        8       0 
2        6       4 
2        7       4 
2        8       2 
3        7       4 
3        5       6
...

GroupID 是一个因子，Dist1、Dist2 是整数。

我有一个派生的数据框，SummaryA

GroupID  AveD1  AveD2 ...
1        4       8 
2        6       2
3        6       5
...

对于每个 groupID，我需要找到具有最小值的 ROW NUMBER，以进行进一步的操作，并将数据提取到我的摘要集中。例如，我需要：

GroupID  MinRowD1  
1        1 
2        4 
3        11

在比赛中，我选择哪个并不重要，但我不知道我是如何得到这个的。我不能使用 which()，因为它不能很好地对因子进行操作，我不能使用 ave(Fun=min)，因为我需要位置，而不是最小值。如果我为每个组做一些匹配到最低限度的事情，我可以有多个匹配，这搞砸了。

关于如何做到这一点的任何建议？

score 7 · Accepted Answer

使用您by的rownames数据

> dat$row <- 1:nrow(dat)
>  by(dat,dat$GroupID,FUN = function(x) rownames(x)[which.min(x$Dist1)])
dat$GroupID: 1
[1] "3"
---------------------------------------------------------------------------------------- 
dat$GroupID: 2
[1] "4"
---------------------------------------------------------------------------------------- 
dat$GroupID: 3
[1] "11"

在这里我假设

dat <- read.table(text = 'GroupID  Dist1   Dist2
1        4       4 
1        5       4 
1        3       16 
2        0       4 
2        7       2 
2        8       0 
2        6       4 
2        7       4 
2        8       2 
3        7       4 
3        5       6', header = T)

编辑data.table使用包的另一个解决方案

我认为 data.table 提供了更优雅的解决方案：

library(data.table)

dat$row <- 1:nrow(dat)
dtb <- as.data.table (dat)
dtb [,.SD[which.min(Dist1)],by=c('GroupID')]
   GroupID Dist1 Dist2 row
1:       1     3    16   3
2:       2     0     4   4
3:       3     5     6  11

Edit1行表而不创建行列（@Arun 评论）

dtb[, {i = which.min(Dist1); list(Dist1=Dist1[i], 
    Dist2=Dist2[i], rowNew=.I[i])}, by=GroupID]

  GroupID Dist1 Dist2 rowNew
1:       1     3    16   3
2:       2     0     4   4
3:       3     5     6  11

score 5 · Accepted Answer

这是一个基本的 R 解决方案；基本思路是通过 GroupID 拆分数据，获取每个数据最小值的行，然后将其重新组合在一起。有些人认为plyr函数是一种更直观的方式来做到这一点；我确信使用其中一个的解决方案很快就会出现......

A$row <- 1:nrow(A)
As <- split(A, A$GroupID)
sapply(As, function(Ai) {Ai$row[which.min(Ai$Dist1)]})

对于大型数据集，split在标量而不是数据帧上执行时会更快，就像这样。

rows <- split(1:nrow(A), A$GroupID)
sapply(rows, function(rowi) {rowi[which.min(A$Dist1[rowi])]})

score 3 · Accepted Answer

假设dat从@agstudy 的答案，那么aggregate()是一个很好的基本功能，可以轻松地做你想做的事。（此答案使用which.min()，在存在多个值的情况下具有有趣的行为，该值在输入向量中取最小值。请参阅最后的警告！）。例如

aggregate(cbind(Dist1, Dist2) ~ GroupID, data = dat, FUN = which.min)

> aggregate(cbind(Dist1, Dist2) ~ GroupID, data = dat, FUN = which.min)
  GroupID Dist1 Dist2
1       1     3     1
2       2     1     3
3       3     2     1

获取行 ID，或者获取行名，我们可以这样做（在示例中添加一些行名之后）：

rownames(dat) <- letters[seq_len(nrow(dat))] ## add rownames for effect

## function, pull out for clarity
foo <- function(x, rn) rn[which.min(x)]
## apply via aggregate
aggregate(cbind(Dist1, Dist2) ~ GroupID, data = dat, FUN = foo,
          rn = rownames(dat))

这使

>     rownames(dat) <- letters[seq_len(nrow(dat))] ## add rownames for effect
> 
>     ## function, pull out for clarity
>     foo <- function(x, rn) rn[which.min(x)]
>     ## apply via aggregate
>     aggregate(cbind(Dist1, Dist2) ~ GroupID, data = dat, FUN = foo,
+               rn = rownames(dat))
  GroupID Dist1 Dist2
1       1     c     a
2       2     a     c
3       3     b     a

我发现aggregate()输出比by()公式界面更好（虽然不是最有效的使用方式）当然非常直观。

警告

which.min()如果至少没有重复值，那就太好了。如果有，则which.min()选择第一个具有最小值的值。或者，有which(x == min(x))成语，但是任何解决方案都需要处理存在重复最小值的事实。

dat2 <- dat
dat2 <- rbind(dat2, data.frame(GroupID = 1, Dist1 = 3, Dist2 = 8))

aggregate(cbind(Dist1, Dist2) ~ GroupID, data = dat2, FUN = which.min)

错过了重复项。

> aggregate(cbind(Dist1, Dist2) ~ GroupID, data = dat2, FUN = which.min)
  GroupID Dist1 Dist2
1       1     3     1
2       2     1     3
3       3     2     1

which(x == min(x))将其与成语进行对比：

out <- aggregate(cbind(Dist1, Dist2) ~ GroupID, data = dat2,
          FUN = function(x) which(x == min(x)))
> (out <- aggregate(cbind(Dist1, Dist2) ~ GroupID, data = dat2,
+                   FUN = function(x) which(x == min(x))))
  GroupID Dist1 Dist2
1       1  3, 4  1, 2
2       2     1     3
3       3     2     1

虽然使用的输出which(x == min(x))很吸引人，但对象本身要复杂一些，它是一个以列表为组件的数据框：

> str(out)
'data.frame':   3 obs. of  3 variables:
 $ GroupID: num  1 2 3
 $ Dist1  :List of 3
  ..$ 0: int  3 4
  ..$ 1: int 1
  ..$ 2: int 2
 $ Dist2  :List of 3
  ..$ 0: int  1 2
  ..$ 1: int 3
  ..$ 2: int 1

score 2 · Accepted Answer

假设 dFrame 包含您的数据

 install.packages('plyr')
 library('plyr')

试试这个：

 dFrame$GroupID<-as.numeric(dFrame$GroupID) ## casting to numeric type
 dFrame<-arrange(dFrame,Dist1) ## sorting the frame by Dist1 to find min by Dist1
 dFrame$row_name<-1:nrow(dFrame) ## will use this to pick out the index

 newFrame<-data.frame(GroupID = unique(dFrame$GroupID), MinRowD1 = as.numeric(tapply(dFrame$row_name,dFrame$GroupID,FUN = function(x){return (x[1])})

score 1 · Accepted Answer

有点令人费解，但这应该可以解决问题：

x <- data.frame(GroupID=rep(1:3,each=3),Dist1=rpois(9,5))
x
  GroupID Dist1
1       1    10
2       1     5
3       1     3
4       2     9
5       2     9
6       2    13
7       3    10
8       3    10
9       3     4
sapply(lapply(lapply(split(x,x$GroupID),
    function(y) y[order(y[2]),]),head,1),rownames)
  1   2   3 
"3" "4" "9"

score 0 · Accepted Answer

这将返回与每组中第一个最小值相关联的两列的行名。并将它们作为具有命名列的数据框返回：

do.call(rbind, 
   by(dat,dat$GroupID,FUN = function(x) c(
                               minD1=rownames(x)[which.min(x[['Dist1']])], 
                               minD2=rownames(x)[which.min(x[['Dist2']])] ) ) )
#-------------
  minD1 minD2
1 "3"   "1"  
2 "4"   "6"  
3 "11"  "10"

r - 如何找到 R 中每个因子的最小行数？

6 回答 6

警告

Related

Reference