我试图解决这个问题的方法是(正如你上面提到的)关心变量 1 和变量 2 之间的距离。因此,我将创建一个名为 distance 的新字段(在下面我将其命名为 diff),它将被计算为 variable1-variable2。然后,我将按该列对数据帧进行排序,并逐行拆分数据帧,即每个奇数行将进入 pot1,每个偶数行将进入 pot2。以以下代码为例进行演示:
id<-1:2000
a<-runif(2000,-100,100)
b<-runif(2000,-200,200)
mydf <- data.frame(id,a,b)
mydf['diff'] <- mydf[['a']] - mydf[['b']]
mydf<-mydf[with(mydf, order(diff)), ]
head(mydf,20)
输出:
> head(mydf,20) #as you can see the dataframe is ordered by diff (ascending)
id a b diff
1732 1732 -95.96522 198.1666 -294.1318
187 187 -94.24905 196.9341 -291.1831
338 338 -95.31069 194.9997 -290.3104
231 231 -91.98249 194.0672 -286.0497
1513 1513 -97.01006 183.5874 -280.5974
715 715 -94.53303 185.1026 -279.6356
145 145 -99.73511 178.2460 -277.9811
979 979 -87.73586 190.0489 -277.7848
1165 1165 -85.53447 187.6254 -273.1598
1243 1243 -94.75502 176.8572 -271.6122
1208 1208 -77.32021 189.1589 -266.4791
1826 1826 -92.23949 171.6341 -263.8736
167 167 -98.84123 163.6960 -262.5372
1283 1283 -76.54766 185.8721 -262.4197
1391 1391 -72.04732 189.9422 -261.9896
322 322 -77.53867 183.4744 -261.0131
75 75 -88.04799 171.9066 -259.9546
882 882 -65.11661 193.8533 -258.9699
1119 1119 -77.59978 181.2392 -258.8390
1624 1624 -81.81879 175.9795 -257.7983
现在拆分数据框:
samplea_1<-NULL
samplea_2<-NULL
sampleb_1<-NULL
sampleb_2<-NULL
id_1<-NULL
id_2<-NULL
diff_1<-NULL
diff_2<-NULL
for ( i in 1:nrow(mydf) ) {
if(i%%2==0) {
samplea_1 <- append(samplea_1,mydf$a[i])
sampleb_1 <- append(sampleb_1,mydf$b[i])
id_1 <- append(id_1,mydf$id[i])
diff_1 <- append(diff_1,mydf$diff[i])
} else {
samplea_2 <- append(samplea_2,mydf$a[i])
sampleb_2 <- append(sampleb_2,mydf$b[i])
id_2 <- append(id_2,mydf$id[i])
diff_2 <- append(diff_2,mydf$diff[i])
}
}
sample1<-data.frame(samplea_1,sampleb_1,id_1,diff_1)
sample2<-data.frame(samplea_2,sampleb_2,id_2,diff_2)
summary(sample1)
summary(sample2)
输出:
> summary(sample1)
samplea_1 sampleb_1 id_1 diff_1
Min. :-99.2058 Min. :-199.519 Min. : 1.0 Min. :-291.183
1st Qu.:-47.5615 1st Qu.:-100.917 1st Qu.: 495.8 1st Qu.:-105.851
Median : 1.3997 Median : 7.004 Median : 980.5 Median : -1.333
Mean : 0.7047 Mean : 2.044 Mean : 991.0 Mean : -1.340
3rd Qu.: 50.4087 3rd Qu.: 101.678 3rd Qu.:1482.8 3rd Qu.: 99.381
Max. : 99.8470 Max. : 199.833 Max. :2000.0 Max. : 291.797
> summary(sample2)
samplea_2 sampleb_2 id_2 diff_2
Min. :-99.7351 Min. :-199.9494 Min. : 2.0 Min. :-294.132
1st Qu.:-48.4339 1st Qu.: -99.7880 1st Qu.: 509.8 1st Qu.:-106.338
Median : -1.4627 Median : 6.8745 Median :1024.0 Median : -1.425
Mean : -0.7104 Mean : 0.9099 Mean :1010.0 Mean : -1.620
3rd Qu.: 48.1663 3rd Qu.: 94.7360 3rd Qu.:1513.2 3rd Qu.: 99.334
Max. : 99.9496 Max. : 199.8544 Max. :1996.0 Max. : 288.840
正如您所看到的,差异列具有几乎相同的平均值,这有点直观,因为我们根据该列对数据帧进行了排序,但是正如您所看到的,对于列 samplea 和 sampleb,相同的值大致相同!发生这种情况是因为 diff 是从 a 和 b 派生的,但是根据每个单独列 a 和 b 的方差有多高,结果将不太准确。
希望有帮助!