这个问题类似于 R 中数据非常大的其他问题,但我找不到如何合并/加入然后对两个 dfs 执行计算的示例(而不是读取大量数据帧并使用 mclapply 来执行计算)。这里的问题不是加载数据(大约需要 20 分钟,但它们确实会加载),而是合并和汇总。
我已经尝试了所有我能找到的 data.table 方法、不同类型的连接和 ff,但我仍然遇到 vecseq 限制 2^31 行的问题。现在我正在尝试使用 multidplyr 并行执行,但无法弄清楚错误来自哪里。
数据框:species_data # df 约 6500 万行,cols <- c("id","species_id") 查找 #df 约 1700 万行,cols <- c("id","cell_id","rgn_id")并非查找中的所有 id 都出现在 species_data
## make sample dataframes:
lookup <- data.frame(id = seq(2001,2500, by = 1),
cell_id = seq(1,500, by = 1),
rgn_id = seq(801,1300, by = 1))
library(stringi)
species_id <- sprintf("%s%s%s", stri_rand_strings(n = 1000, length = 3, pattern = "[A-Z]"),
pattern = "-",
stri_rand_strings(1000, length = 5, '[1-9]'))
id <- sprintf("%s%s%s", stri_rand_strings(n = 1000, length = 1, pattern = "[2]"),
stri_rand_strings(n = 1000, length = 1, pattern = "[0-4]"),
stri_rand_strings(n = 1000, length = 1, pattern = "[0-9]"))
species_data <- data.frame(species_id, id)
使用 multidplyr 合并和加入 dfs
library(tidyverse)
install.packages("devtools")
library(devtools)
devtools::install_github("hadley/multidplyr")
library(multidplyr)
library(parallel)
species_summary <- species_data %>%
# partition the species data by species id
partition(species_id, cluster = cluster) %>%
left_join(species_data, lookup, by = "id") %>%
dplyr::select(-id) %>%
group_by(species_id) %>%
## total number of cells each species occurs in
mutate(tot_count_cells = n_distinct(cell_id)) %>%
ungroup() %>%
dplyr::select(c(cell_id, species_id, rgn_id, tot_count_cells)) %>%
group_by(rgn_id, species_id) %>%
## number of cells each species occurs in each region
summarise(count_cells_eez = n_distinct(cell_id)) %>%
collect() %>%
as_tibble()
## Error in partition(., species_id, cluster = cluster) : unused argument (species_id)
## If I change to:
species_summary <- species_data %>%
group_by(species_id) %>%
partition(cluster = cluster) %>% ...
## get, "Error in worker_id(data, cluster) : object 'cluster' not found
这是我第一次尝试并行和大数据,我正在努力诊断错误。
谢谢!