statistics - 为什么可以从数据集中删除具有低方差的变量

问问题 2021-02-03T18:36:12.127

48 次

数据分析中的一种常见做法是删除具有低方差的特征（自变量）以降低维度，理由是具有低方差的特征无法解释响应变量（因变量）中的大部分方差。

但是，我并不完全理解这个推理。这是一个反例（在 R 语法中）：

 > independent_variable <- c(100000, 100000.01, 100000.02, 100000.03, 100000.04, 100000.05 )
 > dependent_variable  <- c(1,2,3,4,5,6)
 > cor(independent_variable , dependent_variable)
 [1] 1          #pearsons correlation = 1
 > var(independent_variable )
 [1] 0.00035     
 > var(dependent_variable)
 [1] 3.5        # low variance of independent variable compared to dependent variable
 > var(independent_variable/mean(independent_variable))
 3.499998e-14   # very low variance
 > var(dependent_variable/mean(dependent_variable))
 [1] 0.2857143  # variance of scaled variables with mean=1

我在这个例子中试图证明的是因变量和自变量具有相关性=1 的情况，即自变量解释了因变量的 100% 的方差，但是，在原始变量和均值中都按比例缩放变量时，自变量的方差远低于其他变量（在这种情况下为因变量）的方差，因此根据这种推理将其删除。

我在这里想念什么？

statistics - 为什么可以从数据集中删除具有低方差的变量

0 回答 0

Related

Reference