r - 从 PCA 和 QQ 图中识别和去除异常值

Question

我有一个 132 x 107 的数据集，它由 2 种患者类型组成——（患者 1 的 33 人）和（患者 2 的 99 人）。

我正在寻找异常值，所以我在数据集上运行 pca 并使用以下命令完成了第 4 个组件的 qqplots

pca = prcomp(data, scale. = TRUE)
plot(pca$x, pch = 20, col = c(rep("red", 33), rep("blue", 99)))

当我使用以下方法执行第二个组件的 qqplot 时：

qqPlot(pca$x[,2],pch = 20, col = c(rep("red", 33), rep("blue", 99)))

下图显示了 2 个明显的异常值 - 左下角的红点是患者 1。

QQ剧情

是否有任何直接的方法可以计算出数据中这些点的索引以便将它们删除？

score 8 · Accepted Answer

For some reason, I don't believe that the identify method is supported in the car package (the source of qqPlot())

Let's take a look at a PCA of the USArrests data...

pca <- prcomp(USArrests)

The plot of this using qqPlot is easy enough.

require(car)
qqPlot(pca$x[,2],pch = 20, col = c(rep("red", 33), rep("blue", 99)))

However, qqPlot() does not allow for point selection via identify().

identify(qqPlot(pca$x[,2],pch = 20, col = c(rep("red", 33), rep("blue", 99))))
# numeric(0)

You can, however, make use of qqnorm() in the stats package.

identify(qqnorm(pca$x[,2],pch = 20, col = c(rep("red", 33), rep("blue", 99))))

This will produce a less sophisticated graph, but you should be able to add a line and confidence intervals manually via qqline() (also in stats) and a little more math.

score 4 · Accepted Answer

您可以尝试 R 中的identify方法。通常，运行

identify(qqPlot(pca$x[,2],pch = 20, col = c(rep("red", 33), rep("blue", 99))))

并左键单击要识别的点。分数向量中点的索引应与原始数据中的相同。

score 2 · Accepted Answer

您还可以使用库fviz_pca_ind()中的函数将影响可视化factoextra，如下所示：

require(factoextra)
pca = prcomp(mydata)
fviz_pca_ind(pca,
         col.ind = "contrib", # Color by contribution
         gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07") #assign gradient
         )

这会自动标记个人，并根据他们的影响对其进行着色。

r - 从 PCA 和 QQ 图中识别和去除异常值

3 回答 3

Related

Reference