4

好的,这让我感到非常困惑和担心——作为例行程序的一部分,我一直在将变量的单个观察值分类为TRUEFALSE基于它们的值是否高于或低于/等于中值。但是,我在 R 中得到了一种行为,这在执行这个简单的测试时很大程度上是出乎意料的。

所以采取这组观察:

data=c(0.6666667, 0.8333, 0.6666667, 0.8333, 0.8333, 0.75, 0.9999, 0.7499667, 0.25, 0.6666667, 0.1667, 0.7499667, 0.5, 0.2500333, 0.3333667, 0.0834, 0.0001, 0.2500333, 0.8333, 0.9999, 0.9999, 0.2500333, 0.2500333, 0.3333667, 0.9166, 0.5, 0.2500333, 0.4166667, 0.0001, 0.1667333, 0.6666333, 0.0834, 0.1667, 0.6666333, 0.9166, 0.1667, 0.7499333, 0.9166, 0.9166, 0.9166, 0.7499667, 0.7499667, 0.4166667, 0.5, 0.2500333, 0.9166, 0.6666667, 0.1667333, 0.25, 0.0001, 0.3333667, 0.0001, 0.25, 0.0834, 0.9999, 0.0834, 0.1667, 0.5, 0.2500333, 0.3333667, 0.9166, 0.9166, 0.8333, 0.9166, 0.75, 0.0834, 0.4166667, 0.5, 0.0001, 0.9999, 0.8333, 0.6666667, 0.9166)

为了对这些值进行分类,我做了:

data_med=median(data)
quant_data=data
quant_data[quant_data>data_med]="High"
quant_data[quant_data<=data_med]="Low"

我知道有 1 亿种方法可以更有效地做到这一点,但我担心的是,这样做的输出没有意义。由于集合上没有NaNs 并且测试是全包的(><=),我最终应该得到一个只有TRUE/FALSE值的列表,但我得到了:

[1] "High"  "High"  "High"  "High"  "High"  "High"  "High"  "High"  "Low"   "High"  "Low"   "High"  "Low"   "Low"   "Low"   "Low"   "1e-04"
[18] "Low"   "High"  "High"  "High"  "Low"   "Low"   "Low"   "High"  "Low"   "Low"   "Low"   "1e-04" "Low"   "High"  "Low"   "Low"   "High" 
[35] "High"  "Low"   "High"  "High"  "High"  "High"  "High"  "High"  "Low"   "Low"   "Low"   "High"  "High"  "Low"   "Low"   "1e-04" "Low"  
[52] "1e-04" "Low"   "Low"   "High"  "Low"   "Low"   "Low"   "Low"   "Low"   "High"  "High"  "High"  "High"  "High"  "Low"   "Low"   "Low"  
[69] "1e-04" "High"  "High"  "High"  "High"  

看到“1e-04”了吗?更奇怪的是,让我们选择值 69,它是返回奇数的值之一:

data[69]
>1e-04

如果我单独测试这个值,我会得到我期望得到的结果:

data[69]<=data_med
TRUE

有人可以解释这种行为吗?只是看起来很危险...

4

1 回答 1

7

让我们来看看你在这里做了什么。

data=c(0.6666667, 0.8333, 0.6666667, 0.8333, 0.8333, 0.75, 0.9999, 0.7499667, 0.25, 0.6666667, 0.1667, 0.7499667, 0.5, 0.2500333, 0.3333667, 0.0834, 0.0001, 0.2500333, 0.8333, 0.9999, 0.9999, 0.2500333, 0.2500333, 0.3333667, 0.9166, 0.5, 0.2500333, 0.4166667, 0.0001, 0.1667333, 0.6666333, 0.0834, 0.1667, 0.6666333, 0.9166, 0.1667, 0.7499333, 0.9166, 0.9166, 0.9166, 0.7499667, 0.7499667, 0.4166667, 0.5, 0.2500333, 0.9166, 0.6666667, 0.1667333, 0.25, 0.0001, 0.3333667, 0.0001, 0.25, 0.0834, 0.9999, 0.0834, 0.1667, 0.5, 0.2500333, 0.3333667, 0.9166, 0.9166, 0.8333, 0.9166, 0.75, 0.0834, 0.4166667, 0.5, 0.0001, 0.9999, 0.8333, 0.6666667, 0.9166)



data_med=median(data)  ## 0.5
quant_data=data        ## irrelevant
quant_data[quant_data>data_med]="High"

但是通过这样做,您已将 quant_data 转换为字符向量

str(quant_data)
##  chr [1:73] "High" "High" "High" "High" "High" "High" "High" ...

现在字符值和data_med值之间的比较几乎没有意义,因为data_med也会被强制转换为字符值:

"High" < "0.5"  ## FALSE
"1e-4" < "0.5"  ## FALSE -- this is your problem.
quant_data[quant_data<=data_med]="Low"

您大概打算做的事情(以及分配的原因quant_data=data)是:

quant_data[data>data_med]="High"
quant_data[data<=data_med]="Low"
table(quant_data)
## High  Low 
##   35   38 

正如@Arun 在上面的评论中指出的那样,quant_data <- ifelse(data>data_med,"High","Low")也可以。适当使用cut().

于 2013-04-30T17:56:14.977 回答