3

I have a data frame that is made up of mostly sequential rows. Mostly meaning that some are out of sequence or missing. When the the sequential row for the current row is present, I'd like to perform some function using data from both rows. If it's not present, skip it and move on. I know I can do this with a loop, but it's quite slow. I think this has something to do with using the index. Here is an example of my problem using sample data and a desired result that uses a loop.

df <- data.frame(id=1:10, x=rnorm(10))
df <- df[c(1:3, 5:10), ]
df$z <- NA


dfLoop <- function(d)
{
  for(i in 1:(nrow(d)-1))
  {
    if(d[i+1, ]$id - d[i, ]$id == 1)
    {
      d[i, ]$z = d[i+1, ]$x - d[i, ]$x
    }
  }

  return(d)
}

dfLoop(df)

So how might I be able to get the same result without using a loop? Thanks for any help.

4

4 回答 4

3

试试这个:

index <- which(diff(df$id)==1) #gives the index of rows that have a row below in sequence

df$z[index] <- diff(df$x)[index]

作为一个函数:

fun <- function(x) {
  index <- which(diff(x$id)==1)
  xdiff <- diff(x$x)
  x$z[index] <- xdiff[index]
  return(x)
}

与您的循环进行比较:

a <- fun(df)
b <- dfLoop(df)
identical(a, b)
[1] TRUE
于 2013-03-04T22:44:05.740 回答
1

R 是基于向量的。试试这段代码——它就像你的for循环,但一次使用整个范围:

i <- 1:(nrow(d)-1)
d[i+1, ]$id - d[i, ]$id == 1

您应该看到一个长度向量nrow(d) - 1,其中包含条件成立的索引。保存:

cond <- (d[i+1, ]$id - d[i, ]$id == 1)

您还可以获得所有TRUE值的位置:

(cond.pos <- which(cond))

现在您可以将值分配给条件为真的那些索引:

d[cond.pos, ]$z <- d[cond.pos+1, ]$x - d[cond.pos, ]$x

有很多方法可以实现您想要的,但需要一些经验才能抓住“基于矢量”的想法。特别是diff正如 alexwhan 所指出的,该函数可以帮助为这个特定示例节省一些输入。

于 2013-03-04T22:58:24.143 回答
0

这首先计算所有“第一个差异”,然后将非连续行设置为 NA:

 df[1:(nrow(df)-1), "z"] <- df[-1, "x"] - df[-nrow(df), "x"]
 is.na(df[-nrow(df), "z"]) <- diff( df$id) !=1
 df
#
   id           x           z
1   1 -0.04493361  0.02874335
2   2 -0.01619026  0.96002647
3   3  0.94383621          NA
5   5  0.59390132  0.32507605
6   6  0.91897737 -0.13684107
7   7  0.78213630 -0.70757132
8   8  0.07456498 -2.06391668
9   9 -1.98935170  2.60917744
10 10  0.61982575          NA

负索引在创建略短版本的向量时很有用。该is.na<-函数在其 RHS 上接受一个逻辑参数,并使用它将其 LHS 侧目标中的所有条目设置为 NA,以符合逻辑向量的“判断”。

于 2013-03-05T01:36:04.933 回答
0

不是最漂亮的,但它会在没有循环的情况下运行:

> df <- data.frame(id=1:10, x=rnorm(10))
> df <- df[c(1:3, 5:10), ]
> df$z <- NA
> df
   id           x  z
1   1 -1.91564886 NA
2   2  0.27260879 NA
3   3 -1.08563119 NA
5   5 -0.13747215 NA
6   6 -0.38367874 NA
7   7 -1.17825737 NA
8   8 -0.08521386 NA
9   9 -0.44392382 NA
10 10 -0.97192253 NA
> 
> temp = c(df$id,1:10)
> temp
 [1]  1  2  3  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10
> 
> idx = which(table(temp)<2)
> idx 
4 
4 
> 
> newdf = df[-idx,]
> newdf
   id           x  z
1   1 -1.91564886 NA
2   2  0.27260879 NA
3   3 -1.08563119 NA
6   6 -0.38367874 NA
7   7 -1.17825737 NA
8   8 -0.08521386 NA
9   9 -0.44392382 NA
10 10 -0.97192253 NA
> 
> newdf$z = newdf$x[2:nrow(df)] - newdf$x[1:(nrow(df)-1)]
> newdf
   id           x          z
1   1 -1.91564886  2.1882577
2   2  0.27260879 -1.3582400
3   3 -1.08563119  0.7019524
6   6 -0.38367874 -0.7945786
7   7 -1.17825737  1.0930435
8   8 -0.08521386 -0.3587100
9   9 -0.44392382 -0.5279987
10 10 -0.97192253         NA
> 
> newdf = rbind(newdf,df[idx,])
> newdf
   id           x          z
1   1 -1.91564886  2.1882577
2   2  0.27260879 -1.3582400
3   3 -1.08563119  0.7019524
6   6 -0.38367874 -0.7945786
7   7 -1.17825737  1.0930435
8   8 -0.08521386 -0.3587100
9   9 -0.44392382 -0.5279987
10 10 -0.97192253         NA
5   5 -0.13747215         NA
> 
> newdf = newdf[order(newdf$id),]
> newdf
   id           x          z
1   1 -1.91564886  2.1882577
2   2  0.27260879 -1.3582400
3   3 -1.08563119  0.7019524
5   5 -0.13747215         NA
6   6 -0.38367874 -0.7945786
7   7 -1.17825737  1.0930435
8   8 -0.08521386 -0.3587100
9   9 -0.44392382 -0.5279987
10 10 -0.97192253         NA
于 2013-03-04T22:52:02.100 回答