我有data.frame
10 个不同的列(每列的长度相同)。我想消除任何 ' NA
' 大于列长 15% 的列。
我是否需要先创建一个函数来计算NA
每列的百分比,然后再data.frame
在我应用该函数的地方创建另一个函数?最好的方法是什么?
First, it's always good to share some sample data. It doesn't need to be your actual data--something made up is fine.
set.seed(1)
x <- rnorm(1000)
x[sample(1000, 150)] <- NA
mydf <- data.frame(matrix(x, ncol = 10))
Second, you can just use inbuilt functions to get what you need. Here, is.na(mydf)
does a logical check and returns a data.frame
of TRUE
and FALSE
. Since TRUE
and FALSE
equate to 1
and 0
, we can just use colMeans
to get the mean of the number of TRUE
(is NA
) values. That, in turn, can be checked according to your stipulations, in this case which columns have more than 15% NA
values?
colMeans(is.na(mydf)) > .15
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
As we can see, we should drop X1, X2, X6, X8, and X9. Again, taking advantage of logical vectors, here's how:
> final <- mydf[, colMeans(is.na(mydf)) <= .15]
> dim(final)
[1] 100 5
> names(final)
[1] "X3" "X4" "X5" "X7" "X10"
你可以像这样用 data.table 来做
将数据加载到 data.table 中。叫它DT。假设第 2 到第 4 列是数字。
Theta = 0.15
Drop <- DT[, lapply(.SD, function (x) {sum(is.na(x))/length(x) > Theta} ), .SDcols = 2:4]
Cols.2.Drop <- names(Drop)[which(Drop==TRUE)]
DT[, (Cols.2.Drop) := NULL]
在这里用数据测试:
Obs Var1 Var2 Var3
A0001 21 21 21
A0002 21 78 321
A0003 32 98 87
A0004 21 12 54
A0005 21 13 45
A0006 21 87 45
B0007 84 NA 45
B0008 21 NA 98
B0009 2 NA 45
B0010 12 NA 45