r - R名称错误（x）<-值取决于循环中的范围箱线图

Question

我有一个包含 270 列和 17392 行的大型数据集。在这 270 个中，我需要选择 235 个。这些行可以按 'Site' 分组，这是一个唯一的数值（例如，1、2 等 - 总共 111 个不同的站点）。每列构成一个“区域”。这是一个小例子（请注意，列和主题更多）：

SubjID LLatVent RLatVent FullSurfArea Site
Subj1  1580.6    2345      180980      1
Subj2  4803.8    2232      210003      1
Subj3  14936     1456      198045      2
Subj4  14556     1200      176079      2

我的目标是计算每个区域的异常值数量，按站点分组，并打印一个包含结果的 csv 文件。如果我使用 1.5*IQR，我的代码可以工作，但如果我使用 2.5*IQR，我会得到一个错误，我不明白为什么。错误是：

Error in names(x) <- value : 'names' attribute [235] must be the same length as the vector [1]

我的代码尝试（失败）：

#start
ALL <- read.csv("ALL.csv")

#get rows of interest (235)

start <- which(colnames(ALL)=="LLatVent")
end <- which(colnames(ALL)=="FullSurfArea")

#create vector with these row numbers

regions <- start:end

#divide by site (111 sites in total)

  df_list <- split(ALL, as.factor(ALL$Site))

  #loop through regions and save subjID in ALL frame outliers_subjID

  for (j in df_list){
    outliers_subjID_list <- list()
    count <- 0
    for (i in regions){
    count <- count + 1
    OutVals <- boxplot(j[,i], plot=FALSE, range=2.5)$out
    outliers_subjID_list[[count]] <- j$SubjID[which(j[,i] %in% OutVals)]
  }
  n.obs <- sapply(outliers_subjID_list, length)
  seq.max <- seq_len(max(n.obs))
  outliers_subjID <- as.data.frame(sapply(outliers_subjID_list, "[", i = seq.max))
  colnames(outliers_subjID) <- colnames(j)[regions]

#write csv files

    write.csv(outliers_subjID, paste0(unique(j$Site), ".csv"))
  }

为什么我使用 range=2.5 时会出错？如果我使用 boxplot.stats(as.matrix(j[,i]), coef=2.5)$out，也会发生同样的情况。

另外，我想在站点计算出每个区域的异常值后计算它们的总数。目前我正在绑定所有的 csv 文件，然后使用 summarise_all 来计算每个区域的观察次数，但我觉得有一种更聪明的方法。

非常感谢，如果我能提供更多信息，请告诉我。

score 0 · Accepted Answer

这可能是因为某些区域没有 2.5 倍 IQR 定义的任何大异常值。if您可以通过使用语句绕过导致错误的行来防止错误。

for (i in regions){
  count <- count + 1 # maybe move this
  OutVals <- boxplot(j[,i], plot=FALSE, range=2.5)$out

  if(length(OutVals)>0)  # <-- add this line
    outliers_subjID_list[[count]] <- j$SubjID[which(j[,i] %in% OutVals)]
}

由于我没有您的数据，因此无法进行测试。您可能需要稍微修改代码。例如，count可能需要移到if语句中。

r - R名称错误（x）<-值取决于循环中的范围箱线图

1 回答 1

Related

Reference