r - 使用 mclapply 和 data.table 进行 R 内存管理

Question

我有一个非常大的数据对象 X（说 10+ GB）。我想在对象内部的类别中并行执行一些操作以使它们快速运行（例如，预测模型的许多拟合）。就 RAM 使用而言，是否更有效：

1）将整个data.table对象传递给子进程，然后对整个对象进行操作（例如子集，但可以是其他事情），然后对较小的数据集进行密集操作。因为我们在分叉，这是否意味着如果我们不修改它，孩子们就不会复制 X 吗？例如，如果我们对 X 进行子集化并将其分配给 v（如下例所示），子进程是否只为 v 使用额外的 RAM，或者它是否在子进程中复制了 X，因此所有生成的子进程都有效地消耗除了父级之外，RAM 的大小为 X。

2）将父项中的数据拆分为一个列表，其中每个元素都包含执行密集操作所需的数据，然后只将所需的内容传递给子分叉进程？这种方法意味着我们在运行 mclapply 之前有效地将父级中使用的内存量加倍，因为该列表有效地包含原始大数据表中的所有相同数据，但正确拆分。

这是一个玩具示例，在没有 mclapply 的情况下，通常在 data.table 中通常是微不足道的，但要点如下：

X <- data.table(x1 = rnorm(1:1000), category = rep(c("a", "b", "c", "d"), 250))

# 1 Pass in the big object:
doIntensiveStuff <- function(y, cat) {
    # y is big, all categories
    # now going to operate on all of y, to subset the rows we need, then do calculations we need
    v <- y[category == cat,]
    mean(v[, x1])
}

z <- lapply(X = X[, unique(category)], FUN = doIntensiveStuff, y = X)#, mc.cores = length(X[, unique(category)]))




 # First approach, pass in the X <- data.table(x1 = rnorm(1:1000), category = rep(c("a", "b", "c", "d"), 250))


doIntensiveStuff <- function(y, cat) {
    # y is big, all categories
    # now going to operate on all of y, to subset the rows we need, then do calculations we need
    v <- y[category == cat,]
    mean(v[, x1])
}

z <- lapply(X = X[, unique(category)], FUN = doIntensiveStuff, y = X)#, mc.cores = length(X[, unique(category)]))


# Second approach. Double the data in the parent, but the forked children get only what they need:

doIntensiveStuff2 <- function(yJustOneCategory, cat) {
    # yJustOneCategory is a smaller object inside the child forked process, while the large X object remains in the parent.
    yJustOneCategory[, mean(x1)]
}

z2 <- mclapply(X = split(X, by = "category"), FUN = doIntensiveStuff2, mc.cores = length(X[, unique(category)]))

all.equal(as.numeric(unlist(z)), as.numeric(unlist(z2)))
# TRUE

    doIntensiveStuff2 <- function(yJustOneCategory, cat) {
        # yJustOneCategory is a smaller object inside the child forked process, while the large X object remains in the parent.
        yJustOneCategory[, mean(x1)]
    }

    z2 <- mclapply(X = split(X, by = "category"), FUN = doIntensiveStuff2, mc.cores = length(X[, unique(category)]))

    all.equal(as.numeric(unlist(z)), as.numeric(unlist(z2)))
    # TRUE

帮助澄清我对 R 中如何将内存用于具有大数据的分叉进程的理解将不胜感激。

r - 使用 mclapply 和 data.table 进行 R 内存管理

0 回答 0

Related

Reference