我有一个非常大的数据集,其中包含多个组。它们都包含相同的信息,但是,偶尔且不一致,这些信息是错误的。在下面的示例中,TheGroup1_A1
和Group2_A1
列不匹配(第 3 行和第 4 行被翻转),因此这些行中的其余信息不可比较。为了纠正这个问题,BETAGROUP1_BETA
应该乘以 -1(同样,考虑到组之间的 A1 列不匹配,如果它们匹配,则 Beta 应该保持原样)。
MARKER GROUP1_A1 GROUP1_A2 GROUP1_BETA GROUP1_SE GROUP2_A1 GROUP2_A2 GROUP2_BETA GROUP2_SE
rs10 A C -0.055 0.003 A C 0.056 0.200
rs1000 A G 0.208 0.100 A G 0.208 0.001
rs10000 G C -0.134 0.009 C G -0.8624 0.010
rs10001 C A 0.229 0.012 A C 0.775 0.003
在处理介于 0-1 之间的频率时,我使用的是:
data$GROUP1_oppositeFrequency <- abs( (as.character(data$Group2_A1) !=
as.character(data$Group1_A1)) -
as.numeric(data$Group1_Frequency) )
但是,因为 Beta 值可能是负数,所以这是行不通的。谁能指出我正确的方向?
可重现的数据:
data <- textConnection("SNP,GROUP1_A1,GROUP1_A2,GROUP1_Beta,GROUP1_SE,GROUP2_A1,GROUP2_A2,GROUP2_Beta,GROUP2_SE,GROUP3_A1,GROUP3_A2,GROUP3_Beta,GROUP3_SE
rs1050,C,T,0.0462,0.0035,T,C,0.007,0.0039,C,T,-0.007,0.009
rs1073,A,G,-0.0209,0.0035,A,G,0.0004,0.0031,A,G,-0.009,0.013
rs1075,C,T,-0.001,0.0039,T,C,-0.0013,0.0028,C,T,0.004,0.011
rs1085,C,G,-0.0001,0.0068,C,G,-0.0027,0.0032,C,G,-0.049,0.026
rs1127,C,T,0.0015,0.0044,T,C,0.0002,0.0029,C,T,-0.017,0.009
rs1312,A,G,-0.0014,0.0039,A,G,-0.0025,0.0029,A,G,0,0.01")
test_data <- read.csv(data, header = TRUE, sep = ",")