0

当通过某些条件对 data.frames 进行子集时,如果数据帧包含 NA,则可能会由于某个条件而获得 NA 值。然后它会在子集data.frame中出现问题:

# data generation
set.seed(123)
df <- data.frame(a = 1:100, b = sample(c("moon", "venus"), 100, replace = TRUE), c = sample(c('a', 'b', NA), 100, replace = TRUE))

# indexing
with(df, df[a < 30 & b == "moon" & c == "a",])

你得到:

      a    b    c
NA   NA <NA> <NA>
10   10 moon    a
12   12 moon    a
NA.1 NA <NA> <NA>
NA.2 NA <NA> <NA>
29   29 moon    a

发生这种情况是因为条件导致向量包含 NA,然后这些 NA 将在索引数据帧时产生上述结果。

解决方案之一是以下修复之一:

with(df, df[a < 30 & b == "moon" & (c == "a" & !is.na(c)),])  # exclude NAs
with(df, df[a < 30 & b == "moon" & (c == "a" | is.na(c)),])  # include NAs

但这些非常笨拙 - 想象一下你有一个很长的条件 df[A == x1 & B == x2 & C == x3 & D == x4,],你必须像这样包装每个元素 - df[(A == x1 | is.na(A)) & (B == x2 | is.na(B)) ...,]

如果您只是尝试检查数据框,是否有任何优雅的解决方案不需要您在控制台上编写这些大量代码?

4

3 回答 3

5

好吧,如果您想省略这些NA行,一种快速而笨拙的解决方案是将其包装在which

> with(df, df[a < 30 & b == "moon" & c == "a",])
      a    b    c
NA   NA <NA> <NA>
10   10 moon    a
12   12 moon    a
NA.1 NA <NA> <NA>
NA.2 NA <NA> <NA>
29   29 moon    a
> with(df, df[which(a < 30 & b == "moon" & c == "a"),])
    a    b c
10 10 moon a
12 12 moon a
29 29 moon a

编辑时:在这种情况下,另一种选择可能会被某些人不赞成,但我个人认为这非常有用,它是在括号内定义一个局部变量:

> with(df, df[{i<-a < 30 & b == "moon" & c == "a"; i | is.na(i)},])
    a    b    c
6   6 moon <NA>
10 10 moon    a
12 12 moon    a
15 15 moon <NA>
18 18 moon <NA>
29 29 moon    a
> with(df, df[{i<-a < 30 & b == "moon" & c == "a"; i & !is.na(i)},])
    a    b c
10 10 moon a
12 12 moon a
29 29 moon a

这比编写特殊函数或在单​​独的行上定义索引更简洁,并且适用于许多没有 R 函数完全符合您要求的情况。

于 2013-10-17T15:40:18.677 回答
1

你可以使用这个data.table包。这将简化代码,因为您不必将所有内容都包含在 a 中with(df, ...),并且它将 NA 视为 FALSE。

require(data.table)
dt <- data.table(df)
dt[a < 30 & b == "moon" & c == "a",] # exclude NAs
dt[a < 30 & b == "moon" & (c == "a"|is.na(c)),] # include NAs
于 2013-10-17T15:40:29.197 回答
1
clean <- function(x, include = FALSE){
    x[is.na(x)] <- include
    x
}

# Original output
with(df, df[a < 30 & b == "moon" & c == "a",])
# Clean it up and remove NAs
with(df, df[clean(a < 30 & b == "moon" & c == "a"),])
# Clean it up but include NAs
with(df, df[clean(a < 30 & b == "moon" & c == "a", include = TRUE),])

这使

> with(df, df[a < 30 & b == "moon" & c == "a",])
      a    b    c
NA   NA <NA> <NA>
10   10 moon    a
12   12 moon    a
NA.1 NA <NA> <NA>
NA.2 NA <NA> <NA>
29   29 moon    a
> 
> with(df, df[clean(a < 30 & b == "moon" & c == "a"),])
    a    b c
10 10 moon a
12 12 moon a
29 29 moon a
> with(df, df[clean(a < 30 & b == "moon" & c == "a", include = TRUE),])
    a    b    c
6   6 moon <NA>
10 10 moon    a
12 12 moon    a
15 15 moon <NA>
18 18 moon <NA>
29 29 moon    a

Usingwhich也可以工作,但默认情况下它只允许您排除值

于 2013-10-17T15:41:12.683 回答