r - 在 data.frame 中报告缺失值的优雅方式

Question

这是我编写的一小段代码，用于报告数据框中缺少值的变量。我试图想出一种更优雅的方法来做到这一点，一种可能返回 data.frame 的方法，但我被困住了：

for (Var in names(airquality)) {
    missing <- sum(is.na(airquality[,Var]))
    if (missing > 0) {
        print(c(Var,missing))
    }
}

编辑：我正在处理具有数十到数百个变量的 data.frames，因此我们只报告具有缺失值的变量是关键。

score 164 · Accepted Answer

只需使用sapply

> sapply(airquality, function(x) sum(is.na(x)))
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0

您还可以在由创建的矩阵上使用apply或colSumsis.na()

> apply(is.na(airquality),2,sum)
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0
> colSums(is.na(airquality))
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0

score 9 · Accepted Answer

我对（不太宽的）数据的新宠是来自优秀的naniar包的方法。您不仅会获得频率，还会获得缺失模式：

library(naniar)
library(UpSetR)

riskfactors %>%
  as_shadow_upset() %>%
  upset()

查看缺失与非缺失的关系通常很有用，这可以通过绘制带有缺失的散点图来实现：

ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) +
 geom_miss_point()

或者对于分类变量：

gg_miss_fct(x = riskfactors, fct = marital)

这些示例来自列出了其他有趣的可视化的包vignette。

score 8 · Accepted Answer

我们可以map_df与 purrr 一起使用。

library(mice)
library(purrr)

# map_df with purrr
map_df(airquality, function(x) sum(is.na(x)))
# A tibble: 1 × 6
# Ozone Solar.R  Wind  Temp Month   Day
# <int>   <int> <int> <int> <int> <int>
# 1    37       7     0     0     0     0

score 6 · Accepted Answer

summary(airquality)

已经给你这个信息

VIM包还为 data.frame 提供了一些不错的缺失数据图

library("VIM")
aggr(airquality)

score 4 · Accepted Answer

更简洁——：sum(is.na(x[1]))

那是

x[1]看第一列
is.na()如果是真的NA
sum() TRUE是1，FALSE是0

score 4 · Accepted Answer

另一个图形替代方案 -plot_missing来自优秀DataExplorer包的功能：

Docs还指出您可以保存此结果以使用missing_data <- plot_missing(data).

score 2 · Accepted Answer

另一个可以帮助您查看缺失数据的函数是 funModeling 库中的 df_status

library(funModeling)

iris.2 是添加了一些 NA 的 iris 数据集。您可以将其替换为您的数据集。

df_status(iris.2)

这将为您提供每列中 NA 的数量和百分比。

score 2 · Accepted Answer

对于另一种图形解决方案，visdat 包提供vis_miss.

library(visdat)
vis_miss(airquality)

与输出非常相似，Amelia在开箱即用时给出 %s 的差异很小。

score 1 · Accepted Answer

另一种图形和交互方式是使用库中is.na10的函数heatmaply：

library(heatmaply)

heatmaply(is.na10(airquality), grid_gap = 1, 
          showticklabels = c(T,F),
            k_col =3, k_row = 3,
            margins = c(55, 30), 
            colors = c("grey80", "grey20"))

可能不适用于大型数据集..

score 1 · Accepted Answer

我认为 Amelia 库在处理缺失数据方面做得很好，还包括一个用于可视化缺失行的地图。

install.packages("Amelia")
library(Amelia)
missmap(airquality)

也可以运行下面的代码会返回na的逻辑值

row.has.na <- apply(training, 1, function(x){any(is.na(x))})

score 1 · Accepted Answer

A dplyr solution to get the count could be:

summarise_all(df, ~sum(is.na(.)))

Or to get a percentage:

summarise_all(df, ~(sum(is_missing(.) / nrow(df))))

Maybe also worth noting that missing data can be ugly, inconsistent, and not always coded as NA depending on the source or how it's handled when imported. The following function could be tweaked depending on your data and what you want to consider missing:

is_missing <- function(x){
  missing_strs <- c('', 'null', 'na', 'nan', 'inf', '-inf', '-9', 'unknown', 'missing')
  ifelse((is.na(x) | is.nan(x) | is.infinite(x)), TRUE,
         ifelse(trimws(tolower(x)) %in% missing_strs, TRUE, FALSE))
}

# sample ugly data
df <- data.frame(a = c(NA, '1', '  ', 'missing'),
                 b = c(0, 2, NaN, 4),
                 c = c('NA', 'b', '-9', 'null'),
                 d = 1:4,
                 e = c(1, Inf, -Inf, 0))

# counts:
> summarise_all(df, ~sum(is_missing(.)))
  a b c d e
1 3 1 3 0 2

# percentage:
> summarise_all(df, ~(sum(is_missing(.) / nrow(df))))
     a    b    c d   e
1 0.75 0.25 0.75 0 0.5

score 0 · Accepted Answer

如果您想为特定列执行此操作，那么您也可以使用它

length(which(is.na(airquality[1])==T))

score 0 · Accepted Answer

0

ExPanDaR 的包函数prepare_missing_values_graph可用于探索面板数据：

于 2019-04-05T04:11:09.633 回答

score 0 · Accepted Answer

对于管道，您可以编写：

# Counts 
df %>% is.na() %>% colSums()

# % of missing rounded to 2 decimals 
df %>% summarise_all(.funs = ~round(100*sum(is.na(.))/length(.),2))

r - 在 data.frame 中报告缺失值的优雅方式

14 回答 14

Related

Reference