r - 在 R 中所谓的扁平数据框中查找拼写错误（按因素计算的行不同）

Question

我有一个所谓的扁平数据框，有大约 40 列不同的数据类型。对于前 15 列左右，有一个变量充当唯一索引。因为它是一个扁平化的关系数据库，所以在这些列中，该索引变量中具有相同值的所有行应该是相同的。但他们不是。我想找出错别字在哪里。

我做了这个非常简化的例子：

    structure(list(f = structure(c(1L, 2L, 3L, 3L, 4L, 5L, 6L, 6L, 
7L, 7L), .Label = c("a", "b", "c", "d", "e", "f", "g"), class = "factor"), 
    number = c(1, 2, 3, 3, 4, 5, 6, 7, 21, 21), name = structure(c(1L, 
    2L, 4L, 3L, 5L, 6L, 7L, 7L, 8L, 8L), .Label = c("alfa", "beta", 
    "calostrE", "calostrO", "dedo", "elefante", "fiasco", "general"
    ), class = "factor")), .Names = c("f", "number", "name"), row.names = c(NA, 
-10L), class = "data.frame")

它看起来像这样：

   f number     name
1  a      1     alfa
2  b      2     beta
3  c      3 calostrO
4  c      3 calostrE
5  d      4     dedo
6  e      5 elefante
7  f      6   fiasco
8  f      7   fiasco
9  g     21  general
10 g     21  general

f 是唯一索引。在我的原始数据框中，这是一个已转换为因子的日期，但这无关紧要。如您所见，第 9 行和第 10 行是正确的，因为所有其他变量值都是相同的。第 1、2、5 和 6 行也是正确的，因为每个因子值只有一行。但是第 3-4 行和第 7-8 行是不正确的：它们有错别字，并且变量的值不相同。

我想要的结果是这样的：

Rows.with.typos..........Column.names  
.....3......................."name"  
.....7......................."number"

如您所见，我也遇到了降价问题。

这个例子很简单，但是如果在多个列中存在不等式（错别字），则最终结果中的“列名”下应该有多个元素。另请注意，我的原始数据框很宽并且有很多列，对于给定的 f 值，只有其中一些列应该是相同的。

事后澄清：所选行始终是该组的第一个（请参阅我对以下评论的回复）。

我只设法得到了有错别字的行，但是以一种非常复杂的方式，我认为发布它没有用。

score 1 · Accepted Answer

我自己创建了一个可以解决问题的函数。更好的是，它创建了另一个 Excel 文件，其中的错别字很容易看到，因为所有其他单元格都用线条填充。我认为它对许多初学者或数据清理者很有用，但也可以确保代码可以完善。变量和函数名称是西班牙语。

detectar_errores<-function(x,variables,index){

#The first argument is the dataframe. The third argument is the index. And the second is a vector, that can be numerical (positional) or of variable names, and specifies which variables are the ones that should have identical values if the variable "index" has the same values.


#Checks packages


if(require("xlsx")){
    print("xlsx está correctamente cargado")
} else {
    print("tratando de instalar xlsx")
    install.packages("xlsx")
    if(require(xlsx)){
        print("xlsx instalado y cargado")
    } else {
        stop("no pude instalar y cargar xlsx")
    }
}

if(require("dplyr")){
    print("dplyr está correctamente cargado")
} else {
    print("tratando de instalar dplyr")
    install.packages("dplyr")
    if(require(dplyr)){
        print("dplyr instalado y cargado")
    } else {
        stop("no pude instalar y cargar dplyr")
    }
}


#Selects the variables and groups by index
#Then creates a new variable, that is TRUE is there is more than one row in the group and there are the same rows as unique index values
#The result is stored in a new dataframe called "primera"

primera<-x %>% select(variables) %>% group_by(index) %>% do({
  clasificador<-nrow(.)==nrow(unique(.)) & nrow(.)>1
  data.frame(.,clasificador) #El punto es un símbolo para el grupo
})

#Selects the rows that interest us and stores them in another dataframe

segunda<-primera[primera$clasificador==T,]

#Creates a function that takes a vector and checks if all its elements are identical(i.e: 3, 3, 3)
#If they are, returns as many NAs as the vector length (that variables doesn't have typos)
#If they aren't, returns the same vector, in order to the discrepancies to be seen

todosiguales<-function(x){
  clase<-class(x)
  if(identical(x,rep(x[1],length(x)))){
  solucion<-rep(NA,length(x))
  class(solucion)<-clase
  return(solucion)
  }else{
return(x)}
}

#Creates a function that substitutes the NAs for lines in a character vector

rayas<-function (y){
  y[is.na(y)]<-"--"
  return(y)
}

#Creates another dataframe by manipulating the previous one
#It groups by the index and then transforms the variables
#It coerces them to character, then applies the function todosiguales and then the funcion rayas

tercera<-segunda %>%
      group_by(index) %>%
      mutate_each(funs(as.character)) %>%
      mutate_each(funs(todosiguales)) %>%
      mutate_each(funs(rayas))

#That returns the last dataframe. Now it's written as a new Excel file

write.xlsx(tercera,"Errores_detectados.xls")
}

score 0 · Accepted Answer

也许您可以尝试：（df1是数据集）。不清楚应该如何进行比较，尤其是当每组有两个以上的条目时f。

df1$name <- as.character(df1$name)
res <- do.call(rbind, lapply(split(df1[, -1], df1$f), function(x) {
indx <- !(duplicated(x) | duplicated(x, fromLast = TRUE) | nrow(x) == 1)
x1 <- x[indx]
x2 <- x1[1, !apply(x1, 2, anyDuplicated) > 0]
if (length(x1) > 0){ 
    data.frame(Rows.with.typos = rownames(x1)[1],
           Column.names = x2, stringsAsFactors = FALSE)}
 }))

 res
 #  Rows.with.typos Column.names
 #c               3     calostrO
 #f               7            6

score 0 · Accepted Answer

以下将仅显示数字和名称列组合时的唯一条目。在这里可以清楚地看到错别字：

> ddf[!duplicated(paste(ddf$number,ddf$name)),]
  f number     name
1 a      1     alfa
2 b      2     beta
3 c      3 calostrO
4 c      3 calostrE
5 d      4     dedo
6 e      5 elefante
7 f      6   fiasco
8 f      7   fiasco
9 g     21  general

第 10 行没有出现，因为它是重复的。

以下将仅显示上面的重复项：

> ddf2 = ddf[!duplicated(paste(ddf$number,ddf$name)),]
> ddf2[duplicated(ddf2$number) | duplicated(ddf2$name),]
  f number     name
4 c      3 calostrE
8 f      7   fiasco

r - 在 R 中所谓的扁平数据框中查找拼写错误（按因素计算的行不同）

3 回答 3

Related

Reference