1

我有一个非常大的数据对象 X(说 10+ GB)。我想在对象内部的类别中并行执行一些操作以使它们快速运行(例如,预测模型的许多拟合)。就 RAM 使用而言,是否更有效:

1)将整个data.table对象传递给子进程,然后对整个对象进行操作(例如子集,但可以是其他事情),然后对较小的数据集进行密集操作。因为我们在分叉,这是否意味着如果我们不修改它,孩子们就不会复制 X 吗?例如,如果我们对 X 进行子集化并将其分配给 v(如下例所示),子进程是否只为 v 使用额外的 RAM,或者它是否在子进程中复制了 X,因此所有生成的子进程都有效地消耗除了父级之外,RAM 的大小为 X。

2)将父项中的数据拆分为一个列表,其中每个元素都包含执行密集操作所需的数据,然后只将所需的内容传递给子分叉进程?这种方法意味着我们在运行 mclapply 之前有效地将父级中使用的内存量加倍,因为该列表有效地包含原始大数据表中的所有相同数据,但正确拆分。

这是一个玩具示例,在没有 mclapply 的情况下,通常在 data.table 中通常是微不足道的,但要点如下:

X <- data.table(x1 = rnorm(1:1000), category = rep(c("a", "b", "c", "d"), 250))

# 1 Pass in the big object:
doIntensiveStuff <- function(y, cat) {
    # y is big, all categories
    # now going to operate on all of y, to subset the rows we need, then do calculations we need
    v <- y[category == cat,]
    mean(v[, x1])
}

z <- lapply(X = X[, unique(category)], FUN = doIntensiveStuff, y = X)#, mc.cores = length(X[, unique(category)]))




 # First approach, pass in the X <- data.table(x1 = rnorm(1:1000), category = rep(c("a", "b", "c", "d"), 250))


doIntensiveStuff <- function(y, cat) {
    # y is big, all categories
    # now going to operate on all of y, to subset the rows we need, then do calculations we need
    v <- y[category == cat,]
    mean(v[, x1])
}

z <- lapply(X = X[, unique(category)], FUN = doIntensiveStuff, y = X)#, mc.cores = length(X[, unique(category)]))


# Second approach. Double the data in the parent, but the forked children get only what they need:

doIntensiveStuff2 <- function(yJustOneCategory, cat) {
    # yJustOneCategory is a smaller object inside the child forked process, while the large X object remains in the parent.
    yJustOneCategory[, mean(x1)]
}

z2 <- mclapply(X = split(X, by = "category"), FUN = doIntensiveStuff2, mc.cores = length(X[, unique(category)]))

all.equal(as.numeric(unlist(z)), as.numeric(unlist(z2)))
# TRUE

    doIntensiveStuff2 <- function(yJustOneCategory, cat) {
        # yJustOneCategory is a smaller object inside the child forked process, while the large X object remains in the parent.
        yJustOneCategory[, mean(x1)]
    }

    z2 <- mclapply(X = split(X, by = "category"), FUN = doIntensiveStuff2, mc.cores = length(X[, unique(category)]))

    all.equal(as.numeric(unlist(z)), as.numeric(unlist(z2)))
    # TRUE

帮助澄清我对 R 中如何将内存用于具有大数据的分叉进程的理解将不胜感激。

4

0 回答 0