r - 导入数据框时过滤多个csv文件

Question

我有大量要读入 R 的 csv 文件。csv 中的所有列标题都是相同的。但是我只想将每个文件中的那些行导入到变量在给定范围内（高于最小阈值和低于最大阈值）的数据框中，例如

   v1   v2   v3
1  x    q    2
2  c    w    4
3  v    e    5
4  b    r    7

过滤 v3 (v3>2 & v3<7) 应导致：

   v1   v2   v3
1  c    w    4
2  v    e    5

因此，我将所有 csv 中的所有数据导入一个数据帧，然后进行过滤：

#Read the data files
fileNames <- list.files(path = workDir)
mergedFiles <- do.call("rbind", sapply(fileNames, read.csv, simplify = FALSE))
fileID <- row.names(mergedFiles)
fileID <- gsub(".csv.*", "", fileID)
#Combining data with file IDs
combFiles=cbind(fileID, mergedFiles)
#Filtering the data according to criteria
resultFile <- combFiles[combFiles$v3 > min & combFiles$v3 < max, ]

我宁愿在将每个单个 csv 文件导入数据框时应用过滤器。我认为 for 循环将是最好的方法，但我不确定如何。我会很感激任何建议。

Edit

在测试了 mnel 的建议后，我得到了一个不同的解决方案：

fileNames = list.files(path = workDir)
mzList = list()
for(i in 1:length(fileNames)){
tempData = read.csv(fileNames[i])
mz.idx = which(tempData[ ,1] > minMZ & tempData[ ,1] < maxMZ)
mz1 = tempData[mz.idx, ]
mzList[[i]] = data.frame(mz1, filename = rep(fileNames[i], length(mz.idx)))
}
resultFile = do.call("rbind", mzList)

感谢所有的建议！

score 3 · Accepted Answer

这是一种data.table允许您使用的方法fread（比快read.csv），rbindlist这是一种非常适合这种情况的超快速实现。do.call(rbind, list(..))它还有一个功能between

library(data.table)
fileNames <- list.files(path = workDir)
alldata <- rbindlist(lapply(fileNames, function(x,mon,max) {
  xx <- fread(x, sep = ',')
  xx[, fileID :=   gsub(".csv.*", "", x)]
  xx[between(v3, lower=min, upper = max, incbounds = FALSE)]
  }, min = 2, max = 3))

如果单个文件很大并且v1总是整数值，则可能值得将其设置v3为键，然后使用二进制搜索，导入所有内容然后运行过滤也可能更快。

score 2 · Accepted Answer

2

如果您想在导入数据之前进行“过滤”，请尝试read.csv.sql从sqldf 包中使用

于 2013-04-09T04:43:17.067 回答

score 0 · Accepted Answer

如果您真的被记忆卡住了，那么以下解决方案可能会起作用。它用于LaF仅读取过滤所需的列；然后计算将被读取的总行数；初始化完整的data.frame，然后从文件中读取所需的行。（它可能并不比其他解决方案快）

library("LaF")

colnames <- c("v1","v2","v3")
colclasses <- c("character", "character", "numeric")

fileNames <- list.files(pattern = "*.csv")

# First determine which lines to read from each file and the total number of lines
# to be read
lines <- list()
for (fn in fileNames) {
  laf <- laf_open_csv(fn, column_types=colclasses, column_names=colnames, skip=1)
  d   <- laf$v3[] 
  lines[[fn]] <- which(d > 2 & d < 7)
}
nlines <- sum(sapply(lines, length))

# Initialize data.frame
df <- as.data.frame(lapply(colclasses, do.call, list(nlines)), 
        stringsAsFactors=FALSE)
names(df) <- colnames

# Read the lines from the files
i <- 0
for (fn in names(lines)) {
  laf <- laf_open_csv(fn, column_types=colclasses, column_names=colnames, skip=1)
  n   <- length(lines[[fn]])
  df[seq_len(n) + i, ] <- laf[lines[[fn]], ]
  i   <- i + n
}

r - 导入数据框时过滤多个csv文件

3 回答 3

Related

Reference