1

I would like to know how to rearrange the source data (table) in order to output the desired table using R or SQL, which are displayed below.

Since looping is very slow in R, and my dataset is quite large... it's not preferred to have too much looping in the script. The efficiency is important.

Source data table:

Date    | Country | ID | Fruit  | Favorite | Money
20120101  US        1    Apple     Book      100
20120101  US        2    Orange    Knife     150
20120101  US        3    Banana    Watch     80
20120101  US        4    Melon     Water     90
20120102  US        1    Apple     Phone     120
20120102  US        2    Apple     Knife     130
20120102  US        3    Banana    Watch     100           
.....     ......    ..   .....     ......    ......

Output table:

Date    | Country | Field   | ID 1 | ID 2  | ID 3  | ID 4
20120101  US        Fruit     Apple  Orange  Banana  Melon
20120101  US        Favorite  Book   Knife   Watch   Water
20120101  US        Money     100    150     80      90
20120102  US        Fruit     Apple  Apple   Banana  N.A.
....      ....      ....      ....   ....    ....    ....
4

1 回答 1

0

这是 R 中的一种方法,使用您的示例数据:

x <- cbind(mydf[, c("Date", "Country", "ID")], 
           stack(mydf[, c("Fruit", "Favorite", "Money")]))
reshape(x, direction = "wide", idvar = c("Date", "Country", "ind"), timevar="ID")
#        Date Country      ind values.1 values.2 values.3 values.4
# 1  20120101      US    Fruit    Apple   Orange   Banana    Melon
# 5  20120102      US    Fruit    Apple    Apple   Banana     <NA>
# 8  20120101      US Favorite     Book    Knife    Watch    Water
# 12 20120102      US Favorite    Phone    Knife    Watch     <NA>
# 15 20120101      US    Money      100      150       80       90
# 19 20120102      US    Money      120      130      100     <NA>

为了与其他选项相结合,这里有一个melt+dcast方法(可以从“data.table”或“reshape2”中获取)和一个“dplyr”+“tidyr”方法。

library(data.table)
dcast(
  suppressWarnings(
    melt(as.data.table(mydf), c("Date", "Country", "ID"))),
    ... ~ ID, value.var = "value")
#        Date Country variable     1      2      3     4
# 1: 20120101      US    Fruit Apple Orange Banana Melon
# 2: 20120101      US Favorite  Book  Knife  Watch Water
# 3: 20120101      US    Money   100    150     80    90
# 4: 20120102      US    Fruit Apple  Apple Banana    NA
# 5: 20120102      US Favorite Phone  Knife  Watch    NA
# 6: 20120102      US    Money   120    130    100    NA  

library(dplyr)
library(tidyr)
mydf %>%
  gather(variable, value, Fruit:Money) %>%
  spread(ID, value)
#       Date Country variable     1      2      3     4
# 1 20120101      US    Fruit Apple Orange Banana Melon
# 2 20120101      US Favorite  Book  Knife  Watch Water
# 3 20120101      US    Money   100    150     80    90
# 4 20120102      US    Fruit Apple  Apple Banana  <NA>
# 5 20120102      US Favorite Phone  Knife  Watch  <NA>
# 6 20120102      US    Money   120    130    100  <NA>

在这个答案中,mydf定义为:

mydf <- structure(
  list(Date = c(20120101L, 20120101L, 20120101L, 
                20120101L, 20120102L, 20120102L, 20120102L), 
       Country = c("US", "US", "US", "US", "US", "US", "US"), 
       ID = c(1L, 2L, 3L, 4L, 1L, 2L, 3L),
       Fruit = c("Apple", "Orange", "Banana", "Melon", 
                 "Apple", "Apple", "Banana"), 
       Favorite = c("Book", "Knife", "Watch", "Water", 
                    "Phone", "Knife", "Watch"), 
       Money = c(100L, 150L, 80L, 90L, 120L, 130L, 100L)), 
  .Names = c("Date", "Country", "ID", 
             "Fruit", "Favorite", "Money"), 
  class = "data.frame", row.names = c(NA, -7L))
于 2013-07-18T09:19:35.397 回答