2

I have a dataframe

df <- data.frame(
  id = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4), 
  date = c("1985-06-19", "1985-06-19", "1985-06-19", "1985-08-01", 
           "1985-08-01", "1990-06-19", "1990-06-19", "1990-06-19", 
           "1990-06-19", "2000-05-12"), 
  spp = c("a", "b", "c", "c", "d", "b", "c", "d", "a", "b"),
  y = rpois(10, 5))

   id       date spp y
1   1 1985-06-19   a 6
2   1 1985-06-19   b 3
3   1 1985-06-19   c 7
4   2 1985-08-01   c 7
5   2 1985-08-01   d 6
6   3 1990-06-19   b 5
7   3 1990-06-19   c 4
8   3 1990-06-19   d 4
9   3 1990-06-19   a 6
10  4 2000-05-12   b 6

I want to expand it so that there is every combination of id and spp and have y = 0 for every combination that is not currently in the dataframe. The dataframe is currently about 100,000 rows and 15 columns. When expanded it would be about 300,000 columns (there are 17 unique values of spp in my actual dataset).

For every value of id the date is the same (e.g. when id = 2, date always = 1985-08-01). In my real dataset all the columns except spp and y can be specified by the id.

I want to end up with something like:

   id       date spp y
   1 1985-06-19   a 6
   1 1985-06-19   b 3
   1 1985-06-19   c 7
   1 1985-06-19   d 0*
   2 1985-08-01   a 0*
   2 1985-08-01   b 0*
   2 1985-08-01   c 7
   2 1985-08-01   d 6
   3 1990-06-19   b 5
   3 1990-06-19   c 4
   3 1990-06-19   d 4
   3 1990-06-19   a 6
   4 2000-05-12   a 0*
   4 2000-05-12   b 6
   4 2000-05-12   c 0*
   4 2000-05-12   d 0*
  • Indicate added rows

I will likely have to do this in the future with potentially much larger data frames, so a quick, efficient (time and memory) way to do this would be appreciated but any solution would satisfy me. I figure there should be ways to use the dplyr, data.table, or reshape packages but I'm not very familiar with any of them. I'm not sure if it would be easiest to expand just rows id, spp, and y, then do a left_join() or merge() to recombine date (and all the other variables in my real dataframe) based on id?

4

3 回答 3

5

expand.grid在这里是一个有用的功能,

mergedData <- merge(
    expand.grid(id = unique(df$id), spp = unique(df$spp)),
    df, by = c("id", "spp"), all =T)

mergedData[is.na(mergedData$y), ]$y <- 0

mergedData$date <- rep(levels(df$date),
                       each = length(levels(df$spp)))

由于您实际上并没有对数据的子集做任何事情,我认为这plyr不会有帮助,也许使用data.table.

于 2014-02-27T05:21:09.193 回答
2

我会去第二种方式,希望这有帮助

x<-unique(df$id)
y<-unique(df$spp)
newdf<-data.frame(x=rep(x,each=length(y)),y=rep(y, length(x)))
merged<-merge(newdf, df, by.x=c(x,y), by.y=c("id","spp"), all=T)
于 2014-02-27T05:02:06.590 回答
2

complete开发版本中有一个新功能tidyr可以做到这一点。当然内部complete使用expand.grid

# get new version of tidyr
devtools::install_github("hadley/tidyr")
# load package
require(tidyr)
# calculations
complete(df, c(id, date), spp, fill = list(y = 0))
##    id       date spp y
## 1   1 1985-06-19   a 5
## 2   1 1985-06-19   b 3
## 3   1 1985-06-19   c 5
## 4   1 1985-06-19   d 0
## 5   2 1985-08-01   a 0
## 6   2 1985-08-01   b 0
## 7   2 1985-08-01   c 4
## 8   2 1985-08-01   d 9
## 9   3 1990-06-19   a 8
## 10  3 1990-06-19   b 3
## 11  3 1990-06-19   c 5
## 12  3 1990-06-19   d 6
## 13  4 2000-05-12   a 0
## 14  4 2000-05-12   b 3
## 15  4 2000-05-12   c 0
## 16  4 2000-05-12   d 0
于 2015-05-22T15:51:11.550 回答