0

I have a CSV file having more than 2000rows with 8 columns. The schema of the csv is as follows.

col0   col1  col2 col3......
1.77   9.1   9.2  8.8
2.34   6.3   0.9  0.44
5.34   6.3   0.9  0.44
9.34   6.3   0.9  0.44........
.
.
.
2000rows with data as above

I am trying to aggregate specific sets of rows(set1: rows1-76, set2:rows96-121..) from the above csv e.g between 1.77 to 9.34 and for all the columns for their corresponding rows- the aggregate of these rows would be one row in my output file. I have tried various methods but i could do it for only a single set in the csv file.

The output would be a csv file having aggregate values of the specified intervals like follows.

col0  col1  col2  col3
3.25   8.2   4.4   3.3  //(aggregate of rows 1-3)
2.2    3.3   9.9   1.2  //(aggregate of rows 6-10) 
and so on..
4

2 回答 2

0

这是使用基本包的一种可能方法:

# Arguments:
# - a data.frame
# - a list of row ranges passes as list 
#   of vectors=[startRowIndex,endRowIndex]
#   used to split the data.frame into sub-data.frames
# - a function that takes a sub-data.frame and returns 
#   the aggregated result
aggregateRanges <- function(DF,ranges,FUN){
  l <- lapply(ranges,function(x){ 
    return(FUN(DF[x[1]:x[2],]))
  }
  )
  return(do.call(rbind.data.frame,l))
}

# example data
data <- read.table(
  header=TRUE,
  text=
    "col0   col1  col2 col3
  1.77   9.1   9.2  8.8
  2.34   6.3   0.9  0.44
  5.34   6.3   0.9  0.44
  9.34   6.3   0.9  0.44
  7.32   4.5   0.3  0.42
  3.77   2.3   0.8  0.13
  2.51   1.4   0.7  0.21
  5.44   5.7   0.7  0.18
  1.12   6.1   0.6  0.34")

# e.g. aggregate by summing sub-data.frames rows
result <- 
aggregateRanges(
  data,
  ranges=list(c(1,3),c(4,7),c(8,9)),
  FUN=function(dfSubset) { 
    rowsum.data.frame(dfSubset,group=rep.int(1,nrow(dfSubset)))
  }
)


> result
    col0 col1 col2 col3
1   9.45 21.7 11.0 9.68
11 22.94 14.5  2.7 1.20
12  6.56 11.8  1.3 0.52
于 2013-09-19T08:13:53.023 回答
0

考虑到 Manetheran 指出的内容,如果尚未完成,您应该添加一列,显示哪一行属于哪个集合。

data.table 方式:

require(data.table)

set.seed(123)
dt <- data.table(col1=rnorm(100),col2=rnorm(100),new=rep(c(1,2),each=50))

dt[,lapply(.SD,mean),by="new"]

   new       col1        col2
1:   1 0.03440355 -0.25390043
2:   2 0.14640827  0.03880684

您可以mean用任何其他“聚合函数”替换

于 2013-09-19T08:00:18.977 回答