我查看了 StackOverFlow 上的以下问题以及其他 R 帮助书籍的链接:
R: Reshape Data Long to Wide - 理解重塑参数
http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/
我想获取我的长格式数据(两列,6200个条目),列“id”中具有非唯一名称,“序列”中具有不同的序列,并重塑为列标题现在为“ id”,每个 id 下面列出了来自“sequence”的所有序列。
id sequence
1 CK1alpha TPSIAsDISLP
2 CK1alpha IASDIsLPIAT
3 CDK1 SVPSSsPGTSV
4 CDK1 EGCQGsPQRRG
5 CK1alpha DICEDsDIDGD
6 PKCepsilon IHGSDsVKSAE
我想得到什么:
id CK1alpha CDK1 PKCepsilon
sequence TPSIAsDISLP SVPSSsPGTSV
IASDIsLPIAT EGCQGsPQRRG
DICEDsDIDGD
我试过用reshape
kinase_sub_wide <- reshape(kinase_substrate, idvar = "id", timevar = "sequence", direction = "wide")
但是我收到警告消息说多行匹配:
Warning messages:
1: In reshapeWide(data, idvar = idvar, timevar = timevar, ... :
multiple rows match for sequence=TPSIAsDISLP: first taken
2: In reshapeWide(data, idvar = idvar, timevar = timevar, ... :
multiple rows match for sequence=IASDIsLPIAT: first taken
3: In reshapeWide(data, idvar = idvar, timevar = timevar, ... :
multiple rows match for sequence=RSQSRsNSPLP: first taken
我也尝试过使用spread
kinase_substrate_wide <- spread(kinase_substrate, id, sequence)
但得到重复标识符的错误:
> kinase_substrate_wide <- spread(kinase_substrate, id, sequence)
Error: Duplicate identifiers for rows (1812, 1813, 4469), (906, 3349), (253, 285, 2114, 2174, 3022, 4385, 4501), (155, 203, 218, 261, 316, 542, 682, 1021, 1123, 1238, 1492, 1919, 1938, 1997, 2064, 2139, 2323, 2387, 2597, 2826, 3058, 3377, 3899, 4024, 4135, 4241, 4314, 4617, 4733, 5055, 5289, 5467, 5726, 5952, 6165), (72, 272, 749, 1100, 2792, 3573, 3858, 4254, 4257), (209, 548, 637, 653, 1034, 1038, 1213, 1387, 1445, 1475, 1476, 1692, 1735, 2635, 3180, 4005, 4661, 4988, 5672, 5870, 6042), (21, 1802), (23, 24, 30, 49, 60, 86, 122, 127, 137, 177, 182, 227, 250, 260, 268, 270, 299, 347, 356, 361, 400, 424, 425, 448, 483, 488, 494, 509, 510, 512, 522, 523, 524, 540, 559, 572, 612, 614, 616, 622, 720, 750, 774, 794, 816, 820, 829, 866, 868, 912, 916, 918, 940, 946, 955, 962, 984, 992, 1004, 1013, 1054, 1055, 1070, 1073, 1083, 1086, 1105, 1140, 1154, 1164, 1179, 1222, 1228, 1230, 1284, 1295, 1316, 1318, 1333, 1334, 1348, 1356, 1375, 1383, 1389, 1390, 1406, 1421, 1444, 1458, 1473, 1474, 1490
如何使用上述任一函数将数据转换为宽格式并将其对应的每个序列置于 id 列下?
提前致谢。
编辑#1
使用建议来包含大卫评论中的索引让我到达那里
reshape(transform(df, indx = ave(as.character(id), id, FUN = seq)), idvar = "indx", timevar = "id", direction = "wide")
结果是:
indx sequence.CK1alpha sequence.CDK1 sequence.PKCepsilon sequence.GRK2 sequence.ICK sequence.CDK5 sequence.PKCbeta sequence.PAK1 sequence.GSK3beta
1 1 TPSIAsDISLP SVPSSsPGTSV IHGSDsVKSAE DIDESsPGTEW VDRLQsEPESI AQAPSsPRVTE GAQAPsSPRVT AQERPsQAAPA NIDNLsPKASH
2 2 IASDIsLPIAT EGCQGsPQRRG KLSGLsFKRNR EKKEEsEESDD DNRVPsPPPTG PAEVKsPEKAK DESTGsIAKRL RSRTPsASNDD FNYNPsPRKSS
5 3 DICEDsDIDGD TLNSGsPEKTC TALAPsTMKIK EESEEsDDDMG PDTKDsPVCPH QKPAAsPRPRR IVENLsSRCSW KQKVDsLLENL SSGAKsPSKSG
7 4 TFEDLsDVEGG HVAVSsPTPET VAKRLsLTMGG MNSSIsSGSGS LKVEGsPTEEA DFTCGsPTAAG YPVSPsDKVLI RALRAsESGI_ FPDDLsLDHSD
16 5 PRSGRsPTGNT TEVPRsPKHAH EKLVLsKLYEE RPTSIsWDGLD ESERGsGSQSS SDTVTsPQRAG EKKVVsLNGEL PGSPLsSQPVL YSDSIsPFNKS
29 6 MSDTGsPGMQR KYSPTsPTYSP EILNRsPRNRK KNRPTsISWDG <NA> GRGAEsPFEEK LVNSAsAQKRS SSKTAsLPGYG PSRTAsFSESR
编辑#2
重塑功能是否有办法避免放入“序列”。在每个名字前面?还是我必须求助于正则表达式来重命名所有列名?
编辑#3
用于从列名gsub
中删除"sequence."
并将其分配给变量:
new_col_names <- names(DF) <- gsub("sequence.", "", names(DF))
然后将 应用于new_col_names
数据框
colnames(DF) <- new_col_names
谢谢大家帮助我!