r - 在 R 中使用 reshape() 函数——从宽到长

Question

我正在尝试从以下内容重新排列 R 中的数据：

Patient ID,Episode Number,Admission Date (A),Admission Date (H),Admission Time (A),Admission Time (H)
1,5,20/08/2011,21/08/2011,1200,1300
2,6,21/08/2011,22/08/2011,1300,1400
3,7,22/08/2011,23/08/2011,1400,1500
4,8,23/08/2011,24/08/2011,1500,1600

类似于：

Record Type,Patient ID,Episode Number,Admission Date,Admission Time
H,1,5,20/08/2011,1200
A,1,5,21/08/2011,1300
H,2,6,21/08/2011,1300
A,2,6,22/08/2011,1400
H,3,7,22/08/2011,1400
A,3,7,23/08/2011,1500
H,4,8,23/08/2011,1500
A,4,8,24/08/2011,1600

（我使用了 CSV 格式，因此更容易将它们用作测试数据）。

我尝试了 reshape() 函数，它有点工作：

> reshape(foo, direction = "long", idvar = 1, varying = 3:dim(foo)[2], 
> sep = "..", timevar = "dataset")
     Patient.ID Episode.Number dataset Admission.Date Admission.Time
1.A.          1              5      A.     20/08/2011           1200
2.A.          2              6      A.     21/08/2011           1300
3.A.          3              7      A.     22/08/2011           1400
4.A.          4              8      A.     23/08/2011           1500
1.H.          1              5      H.     21/08/2011           1300
2.H.          2              6      H.     22/08/2011           1400
3.H.          3              7      H.     23/08/2011           1500
4.H.          4              8      H.     24/08/2011           1600

但它不是我想要的确切格式（我想要每个“患者 ID”，第一行是“H”，第二行是“A”）。

此外，当我将其扩展到读取数据（有 250 多列）时，它失败了：

> reshape(realdata, direction = "long", idvar = 1, varying = 
> 6:dim(foo)[2], sep = "..", timevar = "dataset")
Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying,  : 
  'varying' arguments must be the same length

我认为部分是因为 colnames 看起来像：

> colnames(foo)
  [1] "Unique.Key"                                    
  [2] "Campus.Code"                                   
  [3] "UR"                                            
  [4] "Terminal.digit"                                
  [5] "Admission.date..A."                      
  [6] "Admission.date..H."                     
  [7] "Admission.time..A."                      
  [8] "Admission.time..H."     
  .
  .
  .
 [31] "Medicare.Number"                               
 [32] "Payor"                                         
 [33] "Doctor.specialty"                              
 [34] "Clinic"     
  .
  .
  .
 [202] "Admission.Source..A."                    
 [203] "Admission.Source..H."

即有后缀的列之间有“公共列”（没有后缀）（希望这是有道理的）。

score 2 · Accepted Answer

“reshape”（现在是“reshape2”）包中使用melt和cast（现在dcast和家庭）的建议不会让您使用您的数据找到您正在寻找的表格。特别是，如果您的最终目标是您描述的“半长”格式，则需要进行一些额外的处理。

您在问题中提出了两个问题：

首先是结果的排序。正如@RichieCotton在他的评论和@mac在他的回答中指出的那样，一个电话order()就足以解决这个问题。

其次是错误：

Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying, :
  'varying' arguments must be the same length

这是因为，正如您所猜测的，您的选择列表中有不变的列varying = 6:dim(foo)[2]。

解决此问题的一种简单方法是使用grep识别哪些列是变化的，并使用它来指定您的列，而不是像您那样使用（不正确的）包罗万象。这是一个工作示例：

set.seed(1)
foo <- data.frame(Unique.Key = 1:4, Campus.Code = LETTERS[1:4], 
                  Admission.Date..A = 11:14, Admission.Date..H = 21:24,
                  Medicare.Number = letters[1:4], Payor = letters[1:4],
                  Admission.Source..A = rnorm(4), 
                  Admission.Source..H = rnorm(4))
foo
#   Unique.Key Campus.Code Admission.Date..A Admission.Date..H Medicare.Number
# 1          1           A                11                21               a
# 2          2           B                12                22               b
# 3          3           C                13                23               c
# 4          4           D                14                24               d
#   Payor Admission.Source..A Admission.Source..H
# 1     a          -0.6264538           0.3295078
# 2     b           0.1836433          -0.8204684
# 3     c          -0.8356286           0.4874291
# 4     d           1.5952808           0.7383247

找出哪些列是变化的并将其用作您的varying论点：

varyingCols <- grep("\\.\\.A$|\\.\\.H$", names(foo))

out <- reshape(foo, direction = "long", idvar = "Unique.Key", 
               varying = varyingCols, sep = "..")
out[order(out$Unique.Key, rev(out$time)), ]
#     Unique.Key Campus.Code Medicare.Number Payor time Admission.Date Admission.Source
# 1.H          1           A               a     a    H             21        0.3295078
# 1.A          1           A               a     a    A             11       -0.6264538
# 2.H          2           B               b     b    H             22       -0.8204684
# 2.A          2           B               b     b    A             12        0.1836433
# 3.H          3           C               c     c    H             23        0.4874291
# 3.A          3           C               c     c    A             13       -0.8356286
# 4.H          4           D               d     d    H             24        0.7383247
# 4.A          4           D               d     d    A             14        1.5952808

如果您的数据很小（列不多），您可以手动计算varying列的位置并指定向量。正如您已经注意到的，任何未指定的列idvar或被varying适当地回收。

out <- reshape(foo, direction = "long", idvar = "Unique.Key", 
               varying = c(3, 4, 7, 8), sep = "..")

score 0 · Accepted Answer

您可能可以通过使用熔化和铸造或重塑来获得您想要的东西，但您正在寻找非常具体的东西，因此直接进行重塑可能更简单。您可以将原始数据子集为两个单独的数据帧（一个用于 A，一个用于 H），然后将它们重新粘合在一起。

下面的代码适用于您的示例数据，但我也尝试足够灵活地编写它，以便它有望适用于您更大的数据集，只要列的命名与 ..A 一致。和..H。后缀。

#grab the common columns and the "A" columns
#(by using grepl to find any column that doesn't end in ".H.")
foo.a <- foo[,!grepl(x=colnames(foo),pattern = "\\.H\\.$")]

#strip the "..A." from the end of the ".A." column names
colnames(foo.a) <- sub(x=colnames(foo.a),
                   pattern="(.*)\\.\\.A\\.$",
                   rep = "\\1")
foo.a$Record.Type <- "A"

#grab the common columns and the "H" columns
#(by using grepl to find any column that doesn't end in ".A.")
foo.h <- foo[,!grepl(x=colnames(foo),pattern = "\\.A\\.$")]

#strip the "..H." from the end of the "..H." column names
colnames(foo.h) <- sub(x=colnames(foo.h),
                   pattern="(.*)\\.\\.H\\.$",
                   rep = "\\1")
foo.h$Record.Type <- "H"

#stick them back together
new.foo <- rbind(foo.a,foo.h)

#order by Patient.ID
new.foo <- new.foo[with(new.foo,order(Patient.ID)),]

#re-order the columns as you like
new.foo <- new.foo[,c(1,2,5,3,4)]

这给了我：

> new.foo  
   Patient.ID Episode.Number Record.Type Admission.Date Admission.Time  
1          1              5           A     20/08/2011           1200  
5          1              5           H     21/08/2011           1300  
2          2              6           A     21/08/2011           1300  
6          2              6           H     22/08/2011           1400  
3          3              7           A     22/08/2011           1400  
7          3              7           H     23/08/2011           1500  
4          4              8           A     23/08/2011           1500  
8          4              8           H     24/08/2011           1600

r - 在 R 中使用 reshape() 函数——从宽到长

2 回答 2

Related

Reference