2

数据集描述了多个集群的多次重复测量,每个测量集群对包含在单个列中。我想将数据整理成较长的(er)格式,以便一列提供有关集群的信息,但每个测量值都保留在自己的列中。

# Current format
df_wider <- data.frame(
  id = 1:5,
  fruit_1 = sample(fruit, size = 5),
  date_1 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
  number_1 = sample(1:100, 5),
  fruit_2 = sample(fruit, size = 5),
  date_2 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
  number_2 = sample(1:100, 5),
  fruit_3 = sample(fruit, size = 5),
  date_3 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
  number_3 = sample(1:100, 5)
)

# Desired format
df_longer <- data.frame(
  id = rep(1:5, each = 3),
  cluster = rep(1:3, 5),
  fruit = sample(fruit, size = 15),
  date = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 15),
  number = sample(1:100, 15)
)

真实数据集包含多达 25 个集群,每个集群有 100 个测量值。我尝试使用tidyr::gather()tidyr::pivot_longer()迭代每个测量值,但生成的中间数据帧的大小呈指数增长。tidyr::pivot_longer()由于值属于不同的类别,因此尝试一步完成是不可能的。我想不出一种方法来将它按比例矢量化。

4

2 回答 2

1

我们可以melt使用data.table

library(data.table)
melt(setDT(df_wider), measure = patterns('^fruit', '^date', '^number' ), 
      value.name = c('fruit', 'date', 'number'), variable.name = 'cluster')
#    id cluster        fruit       date number
# 1:  1       1         date 2020-04-16     17
# 2:  2       1       quince 2020-01-27      7
# 3:  3       1      coconut 2020-04-19     33
# 4:  4       1  pomegranate 2020-02-27     55
# 5:  5       1    persimmon 2020-02-20     62
# 6:  1       2   kiwi fruit 2020-01-14    100
# 7:  2       2    cranberry 2020-03-15     97
# 8:  3       2     cucumber 2020-03-16      5
# 9:  4       2    persimmon 2020-03-06     81
#10:  5       2         date 2020-04-17     30
#11:  1       3      apricot 2020-04-13     86
#12:  2       3       banana 2020-04-17     42
#13:  3       3     bilberry 2020-02-23     88
#14:  4       3 blackcurrant 2020-02-25     10
#15:  5       3       raisin 2020-02-09     87
于 2020-04-26T18:52:20.143 回答
1

你可以这样做:

library(tidyr)
library(dplyr)

df_wider %>% pivot_longer(-id, 
                          names_pattern = "(.*)_(\\d)", 
                          names_to = c(".value", "cluster"))

# A tibble: 15 x 5
      id cluster fruit        date       number
   <int> <chr>   <fct>        <date>      <int>
 1     1 1       olive        2020-04-21     50
 2     1 2       elderberry   2020-02-23     59
 3     1 3       cherimoya    2020-03-07      9
 4     2 1       jujube       2020-03-22     88
 5     2 2       mandarine    2020-03-06     45
 6     2 3       grape        2020-04-23     78
 7     3 1       nut          2020-01-26     53
 8     3 2       cantaloupe   2020-01-27     70
 9     3 3       durian       2020-02-15     39
10     4 1       chili pepper 2020-03-17     60
11     4 2       raisin       2020-04-14     20
12     4 3       cloudberry   2020-03-11      4
13     5 1       honeydew     2020-01-04     81
14     5 2       lime         2020-03-23     53
15     5 3       ugli fruit   2020-01-13     26
于 2020-04-26T18:10:48.843 回答