我有以下类型(但是非常多的变量和 ind)数据:
mydf <- data.frame (Inv = 1:6, varA = c(1,1,1, 0,1,1),
varB = c(1,0,1, 0, 1,1), varC = c(1,0,0, 0,1,1), varD = c(1,1,1, 0,1,1),
varE = c(1,0,1, 0, 1,1), varF = c(1,1,1, 0, 1,1))
mydf
Inv varA varB varC varD varE varF
1 1 1 1 1 1 1 1
2 2 1 0 0 1 0 1
3 3 1 1 0 1 1 1
4 4 0 0 0 0 0 0
5 5 1 1 1 1 1 1
6 6 1 1 1 1 1 1
我想进行所有一对一的比较(变量和个人/主题),如果它们被重复并且重复的个人/变量的名称作为日志记录到不同的文件中,则只保留一个:
比如上面的数据:
在变量中:
varA is exactly same as varD and varF - so I will just keep varA only in new data
mydf$varA == mydf$varE
[1] TRUE TRUE TRUE TRUE TRUE TRUE
varB and varE has exactly same data - so I will just keep varB
varC is unique
Inv(即科目)中:
1, 5 and 6 are same -> so just keep 1
因此生成的输出文件是
mydf <- data.frame (Inv = 1:4, varA = c(1,1,1, 0),
varB = c(1,0,1, 0), varC = c(1,0,0, 0))
Inv varA varB varC
1 1 1 1 1
2 2 1 0 0
3 3 1 1 0
4 4 0 0 0
我可以通过相关矩阵找到重复:
cor(mydf[,-1])
varA varB varC varD varE varF
varA 1.0000000 0.6324555 0.4472136 1.0000000 0.6324555 1.0000000
varB 0.6324555 1.0000000 0.7071068 0.6324555 1.0000000 0.6324555
varC 0.4472136 0.7071068 1.0000000 0.4472136 0.7071068 0.4472136
varD 1.0000000 0.6324555 0.4472136 1.0000000 0.6324555 1.0000000
varE 0.6324555 1.0000000 0.7071068 0.6324555 1.0000000 0.6324555
varF 1.0000000 0.6324555 0.4472136 1.0000000 0.6324555 1.0000000
我们可以自动化这个过程吗?