4

我有一个由具有缺失值的混合数据类型(数字、字符、因子、序数因子)组成的大型数据库,并且我正在尝试创建一个 for 循环以使用相应列的平均值(如果为数字或字符/因子的模式。

这是我到目前为止所拥有的:

#fake array:
age<- c(5,8,10,12,NA)
a <- factor(c("aa", "bb", NA, "cc", "cc"))
b <- c("banana", "apple", "pear", "grape", NA)
df_test <- data.frame(age=age, a=a, b=b)
df_test$b <- as.character(df_test$b)

for (var in 1:ncol(df_test)) {
    if (class(df_test[,var])=="numeric") {
        df_test[is.na(df_test[,var]) <- mean(df_test[,var], na.rm = TRUE)
} else if (class(df_test[,var]=="character") {
        Mode(df_test$var[is.na(df_test$var)], na.rm = TRUE)
} 
}

其中“模式”是功能:

Mode <- function (x, na.rm) {
    xtab <- table(x)
    xmode <- names(which(xtab == max(xtab)))
    if (length(xmode) > 1)
        xmode <- ">1 mode"
    return(xmode)
}

似乎它只是忽略了这些陈述,没有给出任何错误……我还尝试使用索引来解决第一部分:

## create an index of missing values
index <- which(is.na(df_test)[,1], arr.ind = TRUE)
## calculate the row means and "duplicate" them to assign to appropriate cells
df_test[index] <- colMeans(df_test, na.rm = TRUE) [index["column",]]

但我收到此错误:“colMeans 中的错误(df_test,na.rm = TRUE):'x' 必须是数字”

有谁知道如何解决这个问题?

非常感谢大家的大力帮助!-F

4

2 回答 2

6

如果您只是删除明显的错误,那么它会按预期工作:

Mode <- function (x, na.rm) {
    xtab <- table(x)
    xmode <- names(which(xtab == max(xtab)))
    if (length(xmode) > 1) xmode <- ">1 mode"
    return(xmode)
}

# fake array:
age <- c(5, 8, 10, 12, NA)
a <- factor(c("aa", "bb", NA, "cc", "cc"))
b <- c("banana", "apple", "pear", "grape", NA)
df_test <- data.frame(age=age, a=a, b=b)
df_test$b <- as.character(df_test$b)

print(df_test)

#   age    a      b
# 1   5   aa banana
# 2   8   bb  apple
# 3  10 <NA>   pear
# 4  12   cc  grape
# 5  NA   cc   <NA>

for (var in 1:ncol(df_test)) {
    if (class(df_test[,var])=="numeric") {
        df_test[is.na(df_test[,var]),var] <- mean(df_test[,var], na.rm = TRUE)
    } else if (class(df_test[,var]) %in% c("character", "factor")) {
        df_test[is.na(df_test[,var]),var] <- Mode(df_test[,var], na.rm = TRUE)
    }
}

print(df_test)

#     age  a       b
# 1  5.00 aa  banana
# 2  8.00 bb   apple
# 3 10.00 cc    pear
# 4 12.00 cc   grape
# 5  8.75 cc >1 mode

我建议您使用带有语法高亮和括号匹配的编辑器,这样可以更容易地找到这些类型的语法错误。

于 2011-10-11T23:25:10.527 回答
0

首先,您需要编写模式函数,考虑到长度<1 的分类数据的缺失值。
模式功能:

getmode <- function(v){
  v=v[nchar(as.character(v))>0]
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

然后,您可以迭代列,如果列是数字,则使用平均值填充缺失值,否则使用mode

下面的循环语句:

for (cols in colnames(df)) {
  if (cols %in% names(df[,sapply(df, is.numeric)])) {
    df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), is.na(!!rlang::sym(cols)), mean(!!rlang::sym(cols), na.rm=TRUE)))

  }
  else {

    df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), !!rlang::sym(cols)=="", getmode(!!rlang::sym(cols))))

  }
}

让我们举个例子:

library(tidyverse)

df<-tibble(id=seq(1,10), ColumnA=c(10,9,8,7,NA,NA,20,15,12,NA), 
           ColumnB=factor(c("A","B","A","A","","B","A","B","","A")),
           ColumnC=factor(c("","BB","CC","BB","BB","CC","AA","BB","","AA")),
           ColumnD=c(NA,20,18,22,18,17,19,NA,17,23)
           )

df

具有缺失值的初始 df:

# A tibble: 10 x 5
      id ColumnA ColumnB ColumnC ColumnD
   <int>   <dbl> <fct>   <fct>     <dbl>
 1     1      10 "A"     ""           NA
 2     2       9 "B"     "BB"         20
 3     3       8 "A"     "CC"         18
 4     4       7 "A"     "BB"         22
 5     5      NA ""      "BB"         18
 6     6      NA "B"     "CC"         17
 7     7      20 "A"     "AA"         19
 8     8      15 "B"     "BB"         NA
 9     9      12 ""      ""           17
10    10      NA "A"     "AA"         23

通过运行上面的 for 循环,我们得到:

# A tibble: 10 x 5
      id ColumnA ColumnB ColumnC ColumnD
   <dbl>   <dbl> <fct>   <fct>     <dbl>
 1     1    10   A       BB         19.2
 2     2     9   B       BB         20  
 3     3     8   A       CC         18  
 4     4     7   A       BB         22  
 5     5    11.6 A       BB         18  
 6     6    11.6 B       CC         17  
 7     7    20   A       AA         19  
 8     8    15   B       BB         19.2
 9     9    12   A       BB         17  
10    10    11.6 A       AA         23 

正如我们所见,缺失值已被估算。你可以在这里看到一个例子

于 2020-04-18T15:00:46.070 回答