r - 在两行中消除具有更多 NA 的那一行

Question

我正在寻找一种方法来检查数据框中的两列是否包含一个或多个行的相同元素，然后消除包含更多 NA 的行。

假设我们有一个这样的数据框：

x <- data.frame("Year" = c(2017,2017,2017,2018,2018),
            "Country" = c("Sweden", "Sweden", "Norway", "Denmark", "Finland"),
            "Sales" = c(15, 15, 18, 13, 12),
            "Campaigns" = c(3, NA, 4, 1, 1),
            "Employees" = c(15, 15, 12, 8, 9),
            "Satisfaction" = c(0.8, NA, 0.9, 0.95, 0.87),
            "Expenses" = c(NA, NA, 9000, 7500, 4300))

请注意，瑞典在 2017 年的条目有两次，但第一行有一个带有 NA 的条目，而另一行在三个位置包含 NA。现在我想检查两行是否包含相同的“年份”和“国家”，然后继续消除包含较高数量 NA 的行，在这种情况下是第二行。我做了一些研究，但似乎无法为这种特殊情况找到解决方案。

非常感谢您提前。

score 3 · Accepted Answer

我们可以使用 data.table 方法

library(data.table)
ind <-  setDT(x)[,  {
     i1 <- Reduce(`+`, lapply(.SD, is.na))
    .I[i1 > 0 & (i1 == max(i1))]
    }, .(Year, Country)]$V1
x[-ind]
#    Year Country Sales Campaigns Employees Satisfaction Expenses
#1: 2017  Sweden    15         3        15         0.80       NA
#2: 2017  Norway    18         4        12         0.90     9000
#3: 2018 Denmark    13         1         8         0.95     7500
#4: 2018 Finland    12         1         9         0.87     4300

score 3 · Accepted Answer

使用dplyr：

library(dplyr)
x %>%
  mutate(n_na = rowSums(is.na(.))) %>%  ## calculate NAs for each row      
  group_by(Year, Country) %>%           ## for each year/country
  arrange(n_na) %>%                       ## sort by number of NAs
  slice(1) %>%                            ## take the first row
  select(-n_na)                           ## remove the NA counter column
# A tibble: 4 x 7
# Groups:   Year, Country [4]
   Year Country Sales Campaigns Employees Satisfaction Expenses
  <dbl>  <fctr> <dbl>     <dbl>     <dbl>        <dbl>    <dbl>
1  2017  Norway    18         4        12         0.90     9000
2  2017  Sweden    15         3        15         0.80       NA
3  2018 Denmark    13         1         8         0.95     7500
4  2018 Finland    12         1         9         0.87     4300

score 1 · Accepted Answer

基础 R 解决方案：

x$nas <- rowSums(sapply(x, is.na))
do.call(rbind,
        by(x, x[c("Year","Country")],
           function(df) head(df[order(df$nas),,drop=FALSE], n=1)))
#   Year Country Sales Campaigns Employees Satisfaction Expenses nas
# 4 2018 Denmark    13         1         8         0.95     7500   0
# 5 2018 Finland    12         1         9         0.87     4300   0
# 3 2017  Norway    18         4        12         0.90     9000   0
# 1 2017  Sweden    15         3        15         0.80       NA   1

毫不奇怪，data.table实现速度很快，尽管我对它比基础 R 快多少感到有点惊讶。作为一个小数据集可能会影响这一点。（在基准测试中，我必须创建一个原始副本，因为data.table就地修改数据，因此x不再是data.frame.)

microbenchmark(
  data.table = {
    x0 <- copy(x)
    ind <-  setDT(x0)[,  {
      i1 <- Reduce(`+`, lapply(.SD, is.na))
      .I[i1 > 0 & (i1 == max(i1))]
    }, .(Year, Country)]$V1
    x0[-ind]
  },
  dplyr = {
    x %>%
      mutate(n_na = rowSums(is.na(.))) %>%  ## calculate NAs for each row      
      group_by(Year, Country) %>%           ## for each year/country
      arrange(n_na) %>%                       ## sort by number of NAs
      slice(1) %>%                            ## take the first row
      select(-n_na)                           ## remove the NA counter column
  },
  base = {
    x0 <- x
    x0$nas <- rowSums(sapply(x0, is.na))
    do.call(rbind,
            by(x0, x0[c("Year","Country")],
               function(df) head(df[order(df$nas),,drop=FALSE], n=1)))
  }
)
# Unit: milliseconds
#        expr      min       lq     mean   median       uq       max neval
#  data.table 1.223477 1.441005 1.973714 1.582861 1.919090 12.837569   100
#       dplyr 2.675239 2.901882 4.465172 3.079295 3.806453 42.261540   100
#        base 2.039615 2.209187 2.737758 2.298714 2.570760  8.586946   100

r - 在两行中消除具有更多 NA 的那一行

3 回答 3

Related

Reference