r - 来自文件存储数据帧 (ffdf) 的子集超出内存

Question

我正在 32 位 Windows 机器上处理大约 17M x 4 值的数据集。这在 GNU R 中需要 ~700 MB - 所以当我尝试做一些增强的操作时，很容易达到 2 GB 的限制并且我收到一个内存不足的错误（无法分配向量......）。

没问题 - 有包“ff”将此类数据存储在磁盘上。但是我的第一个子集遇到了同样的错误。根据 ff 文档，我预计“[”将直接子集到另一个 ffdf 中，而无需将数据的两个副本加载到内存中。我到底错在哪里？

ffshares = read.table.ffdf(
  file=tmpFilename, header = FALSE, sep = ",", quote = "\"",
  dec = ".",
  col.names = c("articleID", "measure", "time", "value"),
  na.strings = c("","-1","\\N"),
  colClasses = c("integer","factor","POSIXct","integer"),
  check.names = TRUE, fill = TRUE,
  strip.white = FALSE, blank.lines.skip = TRUE,
  comment.char = "",
  allowEscapes = F, flush = F #, nrow=1000
)
# Until here, the R process requires about 200M

ffshares = ffshares[ffshares[,"articleID"] %in% articles[,"articleID"],]
# As I try this, memory consumption exceeds 1.7G and the available limits

注意：articles 是一个大约 30K 行的数据框。articleID 是一个简单的整数。

奖励问题： ffshares[,"articleID"] 有效，但 ffshares$articleID 无效。根据文档，美元（$）应该像在数据框中一样工作？！

感谢您的任何建议:)

score -1 · Accepted Answer

对于这不是您问题的直接答案，我深表歉意，但我已将bigmemory 包用于类似大小的对象。filebacked.big.matrix 函数和大矩阵优化运算符。如果您没有直接回答您的问题，您可能会发现它很有用。

r - 来自文件存储数据帧 (ffdf) 的子集超出内存

1 回答 1

Related

Reference