r - 使用行名作为列标题将长格式数据重塑为宽格式

Question

我查看了 StackOverFlow 上的以下问题以及其他 R 帮助书籍的链接：

http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/

我想获取我的长格式数据（两列，6200个条目），列“id”中具有非唯一名称，“序列”中具有不同的序列，并重塑为列标题现在为“ id”，每个 id 下面列出了来自“sequence”的所有序列。

          id    sequence
1   CK1alpha TPSIAsDISLP
2   CK1alpha IASDIsLPIAT
3       CDK1 SVPSSsPGTSV
4       CDK1 EGCQGsPQRRG
5   CK1alpha DICEDsDIDGD
6 PKCepsilon IHGSDsVKSAE

我想得到什么：

id          CK1alpha    CDK1        PKCepsilon
sequence    TPSIAsDISLP SVPSSsPGTSV
            IASDIsLPIAT EGCQGsPQRRG
            DICEDsDIDGD

我试过用reshape

kinase_sub_wide <- reshape(kinase_substrate, idvar = "id", timevar = "sequence", direction = "wide")

但是我收到警告消息说多行匹配：

Warning messages:
1: In reshapeWide(data, idvar = idvar, timevar = timevar,  ... :
  multiple rows match for sequence=TPSIAsDISLP: first taken
2: In reshapeWide(data, idvar = idvar, timevar = timevar,  ... :
  multiple rows match for sequence=IASDIsLPIAT: first taken
3: In reshapeWide(data, idvar = idvar, timevar = timevar,  ... :
  multiple rows match for sequence=RSQSRsNSPLP: first taken

我也尝试过使用spread

kinase_substrate_wide <- spread(kinase_substrate, id, sequence)

但得到重复标识符的错误：

> kinase_substrate_wide <- spread(kinase_substrate, id, sequence)
Error: Duplicate identifiers for rows (1812, 1813, 4469), (906, 3349), (253,     285, 2114, 2174, 3022, 4385, 4501), (155, 203, 218, 261, 316, 542, 682, 1021, 1123, 1238, 1492, 1919, 1938, 1997, 2064, 2139, 2323, 2387, 2597, 2826, 3058, 3377, 3899, 4024, 4135, 4241, 4314, 4617, 4733, 5055, 5289, 5467, 5726, 5952, 6165), (72, 272, 749, 1100, 2792, 3573, 3858, 4254, 4257), (209, 548, 637, 653, 1034, 1038, 1213, 1387, 1445, 1475, 1476, 1692, 1735, 2635, 3180, 4005, 4661, 4988, 5672, 5870, 6042), (21, 1802), (23, 24, 30, 49, 60, 86, 122, 127, 137, 177, 182, 227, 250, 260, 268, 270, 299, 347, 356, 361, 400, 424, 425, 448, 483, 488, 494, 509, 510, 512, 522, 523, 524, 540, 559, 572, 612, 614, 616, 622, 720, 750, 774, 794, 816, 820, 829, 866, 868, 912, 916, 918, 940, 946, 955, 962, 984, 992, 1004, 1013, 1054, 1055, 1070, 1073, 1083, 1086, 1105, 1140, 1154, 1164, 1179, 1222, 1228, 1230, 1284, 1295, 1316, 1318, 1333, 1334, 1348, 1356, 1375, 1383, 1389, 1390, 1406, 1421, 1444, 1458, 1473, 1474, 1490

如何使用上述任一函数将数据转换为宽格式并将其对应的每个序列置于 id 列下？

提前致谢。

编辑#1

使用建议来包含大卫评论中的索引让我到达那里

reshape(transform(df, indx = ave(as.character(id), id, FUN = seq)), idvar = "indx", timevar = "id", direction = "wide")

结果是：

   indx sequence.CK1alpha sequence.CDK1 sequence.PKCepsilon sequence.GRK2 sequence.ICK sequence.CDK5 sequence.PKCbeta sequence.PAK1 sequence.GSK3beta
1     1       TPSIAsDISLP   SVPSSsPGTSV         IHGSDsVKSAE   DIDESsPGTEW  VDRLQsEPESI   AQAPSsPRVTE      GAQAPsSPRVT   AQERPsQAAPA       NIDNLsPKASH
2     2       IASDIsLPIAT   EGCQGsPQRRG         KLSGLsFKRNR   EKKEEsEESDD  DNRVPsPPPTG   PAEVKsPEKAK      DESTGsIAKRL   RSRTPsASNDD       FNYNPsPRKSS
5     3       DICEDsDIDGD   TLNSGsPEKTC         TALAPsTMKIK   EESEEsDDDMG  PDTKDsPVCPH   QKPAAsPRPRR      IVENLsSRCSW   KQKVDsLLENL       SSGAKsPSKSG
7     4       TFEDLsDVEGG   HVAVSsPTPET         VAKRLsLTMGG   MNSSIsSGSGS  LKVEGsPTEEA   DFTCGsPTAAG      YPVSPsDKVLI   RALRAsESGI_       FPDDLsLDHSD
16    5       PRSGRsPTGNT   TEVPRsPKHAH         EKLVLsKLYEE   RPTSIsWDGLD  ESERGsGSQSS   SDTVTsPQRAG      EKKVVsLNGEL   PGSPLsSQPVL       YSDSIsPFNKS
29    6       MSDTGsPGMQR   KYSPTsPTYSP         EILNRsPRNRK   KNRPTsISWDG         <NA>   GRGAEsPFEEK      LVNSAsAQKRS   SSKTAsLPGYG       PSRTAsFSESR

编辑#2

重塑功能是否有办法避免放入“序列”。在每个名字前面？还是我必须求助于正则表达式来重命名所有列名？

编辑#3

用于从列名gsub中删除"sequence."并将其分配给变量：

new_col_names <- names(DF) <- gsub("sequence.", "", names(DF))

然后将应用于new_col_names数据框

colnames(DF) <- new_col_names

谢谢大家帮助我！

r - 使用行名作为列标题将长格式数据重塑为宽格式

0 回答 0

Related

Reference