4

我有data.frame10 个不同的列(每列的长度相同)。我想消除任何 ' NA' 大于列长 15% 的列。

我是否需要先创建一个函数来计算NA每列的百分比,然后再data.frame在我应用该函数的地方创建另一个函数?最好的方法是什么?

4

2 回答 2

10

First, it's always good to share some sample data. It doesn't need to be your actual data--something made up is fine.

set.seed(1)
x <- rnorm(1000)
x[sample(1000, 150)] <- NA
mydf <- data.frame(matrix(x, ncol = 10))

Second, you can just use inbuilt functions to get what you need. Here, is.na(mydf) does a logical check and returns a data.frame of TRUE and FALSE. Since TRUE and FALSE equate to 1 and 0, we can just use colMeans to get the mean of the number of TRUE (is NA) values. That, in turn, can be checked according to your stipulations, in this case which columns have more than 15% NA values?

colMeans(is.na(mydf)) > .15
#    X1    X2    X3    X4    X5    X6    X7    X8    X9   X10 
#  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE

As we can see, we should drop X1, X2, X6, X8, and X9. Again, taking advantage of logical vectors, here's how:

> final <- mydf[, colMeans(is.na(mydf)) <= .15]
> dim(final)
[1] 100   5
> names(final)
[1] "X3"  "X4"  "X5"  "X7"  "X10"
于 2013-03-21T17:25:45.193 回答
0

你可以像这样用 data.table 来做

将数据加载到 data.table 中。叫它DT。假设第 2 到第 4 列是数字。

Theta = 0.15
Drop <- DT[, lapply(.SD, function (x) {sum(is.na(x))/length(x) > Theta} ), .SDcols = 2:4]
Cols.2.Drop <- names(Drop)[which(Drop==TRUE)]
DT[, (Cols.2.Drop) := NULL]

在这里用数据测试:

Obs Var1    Var2    Var3
A0001   21  21  21
A0002   21  78  321
A0003   32  98  87
A0004   21  12  54
A0005   21  13  45
A0006   21  87  45
B0007   84  NA  45
B0008   21  NA  98
B0009   2   NA  45
B0010   12  NA  45
于 2016-09-15T10:39:47.853 回答