1

我在 SO 中检查了类似的条目,没有一个能准确回答我的问题。

我的问题是这样的:假设用户 1 有 6 次购买,用户 2 有 2 次。购买数据是这样的:

set.seed(1234)
purchase <- data.frame(id = c(rep("User1", 6), rep("User2", 2)),
                       purchaseid = sample(seq(1, 100, 1), 8),
                       purchaseDate = seq(Sys.Date(), Sys.Date() + 7, 1),
                       price = sample(seq(30, 200, 10), 8))
#
users <- data.frame(id = c("User1","User2"),
                    uname = c("name1", "name2"),
                    uaddress = c("add1", "add2"))
> purchase
     id purchaseid purchaseDate price
1 User1         12   2019-09-27   140
2 User1         62   2019-09-28   110
3 User1         60   2019-09-29   200
4 User1         61   2019-09-30   190
5 User1         83   2019-10-01    60
6 User1         97   2019-10-02   150
7 User2          1   2019-10-03   160
8 User2         22   2019-10-04   120

所需的最终数据包括每个用户的 1 行,其中保留用户名、地址等。然后是 20 次购买的下一列。采购数据需要在同一行中依次放置。这是规则:每个用户只有一行。如果用户没有 20 次购买,则其余字段应为空。

因此,最终数据应如下所示:

  id uname uaddr p1id     p1date p1price p2id     p2date p2price p3id     p3date p3price p4id
1 User1 name1  add1   12 2019-09-27     140   62 2019-09-28     110   60 2019-09-29     200   61
2 User2 name2  add2    1 2019-10-03     160   22 2019-10-04     120   NA       <NA>      NA   NA
      p4date p4price
1 2019-09-30     190
2       <NA>      NA
enddata <- data.frame(id = c("User1", "User2"),
                      uname = c("name1", "name2"),
                      uaddr = c("add1", "add2"),
                      p1id = c(12,1),
                      p1date = c("2019-09-27","2019-10-03"),
                      p1price = c(140, 160),
                      p2id = c(62, 22),
                      p2date = c("2019-09-28", "2019-10-04"),
                      p2price = c(110, 120),
                      p3id = c(60, NA),
                      p3date = c("2019-09-29", NA),
                      p3price = c(200, NA),
                      p4id = c(61, NA),
                      p4date = c("2019-09-30", NA),
                      p4price = c(190, NA))

我使用 reshape 将每个用户的数据转换为宽格式。这个想法是针对每个用户 ID 循环执行此操作。然后我使用rbindlist了填充选项 TRUE,但这次我遇到了列名问题。重塑后,每个都得到不同的列名。如果没有固定数量的列,您也无法设置名称。

对此有什么优雅的解决方案吗?

4

2 回答 2

1

无需单独处理每个 id。相反,我们可以在单个数据框中通过 id 进行操作。下面是一个tidyverse方法。您可以随时停止链以查看中间输出。我添加了注释来解释代码在做什么,但如果有不清楚的地方请告诉我。

library(tidyverse)

dat = users %>% 
  # Join purchase data to user data
  left_join(purchase) %>% 
  arrange(purchaseDate) %>% 
  # Create a count column to assign a sequence number to each purchase within each id.
  # We'll use this later to create columns for each purchase event with a unique 
  # sequence number for each purchase.
  group_by(id) %>% 
  mutate(seq=1:n()) %>% 
  ungroup %>% 
  # Reshape data frame to from "wide" to "long" format
  gather(key, value, purchaseid:price) %>% 
  arrange(seq) %>% 
  # Paste together the "key" and "seq" columns (the resulting column will still be 
  # called "key"). This will allow us to spread the data frame to one row per id 
  # with each purchase event properly numbered.
  unite(key, key, seq, sep="_") %>% 
  mutate(key = factor(key, levels=unique(key))) %>% 
  spread(key, value) %>% 
  # Convert date columns back to Date class
  mutate_at(vars(matches("Date")), as.Date, origin="1970-01-01")

dat
     id uname uaddress purchaseid_1 purchaseDate_1 price_1 purchaseid_2 purchaseDate_2 price_2
1 User1 name1     add1           12     2019-09-27     140           62     2019-09-28     110
2 User2 name2     add2            1     2019-10-03     160           22     2019-10-04     120
  purchaseid_3 purchaseDate_3 price_3 purchaseid_4 purchaseDate_4 price_4 purchaseid_5 purchaseDate_5
1           60     2019-09-29     200           61     2019-09-30     190           83     2019-10-01
2           NA           <NA>      NA           NA           <NA>      NA           NA           <NA>
  price_5 purchaseid_6 purchaseDate_6 price_6
1      60           97     2019-10-02     150
2      NA           NA           <NA>      NA
于 2019-09-27T18:14:42.893 回答
1

另一个选项使用data.table

#pivot to wide format
setDT(users)
setDT(purchase)[, pno := rowid(id)]
ans <- dcast(purchase[users, on=.(id)], id + uname + uaddress ~ pno, 
    value.var=c("purchaseid","purchaseDate", "price"))

#reorder columns
nm <- grep("[1-9]$", names(ans), value=TRUE)
setcolorder(ans, c(setdiff(names(ans), nm), nm[order(gsub("(.*)_", "", nm))]))
ans

输出:

      id uname uaddress purchaseid_1 purchaseDate_1 price_1 purchaseid_2 purchaseDate_2 price_2 purchaseid_3 purchaseDate_3 price_3 purchaseid_4 purchaseDate_4 price_4 purchaseid_5 purchaseDate_5 price_5 purchaseid_6 purchaseDate_6 price_6
1: User1 name1     add1           12     2019-09-30     140           62     2019-10-01     110           60     2019-10-02     200           61     2019-10-03     190           83     2019-10-04      60           97     2019-10-05     150
2: User2 name2     add2            1     2019-10-06     160           22     2019-10-07     120           NA           <NA>      NA           NA           <NA>      NA           NA           <NA>      NA           NA           <NA>      NA
于 2019-09-30T00:55:40.277 回答