我正在尝试执行一些基于一对多实现的匹配left_join
。问题是——即使使用集群计算运行整个事情——基本匹配会产生一个太大而无法处理的数据集。
我收到此错误:
#Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
#Join results in more than 2^31 rows (internal vecseq reached physical limit).
#Very likely misspecified join.
#Check for duplicate key values in i each of which join to the same group in x over and over again.
#If that's ok, try by=.EACHI to run j for each group to avoid the large allocation.
#Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
#Calls: polbur_match_nb ... dt_eval -> eval_tidy -> [ -> [.data.table -> vecseq
连接本身并没有错误指定,但我需要找到一种方法一次完成所有操作。我试图将左侧数据库划分为更合理的块(按状态)并my_matching_function
为每个块运行。
然而,即使对于 p 和 b 都有一个微样本数据库,这也是非常缓慢的。我应该怎么做才能加快整个过程?代码中有什么可以改进的吗
为缺乏可重复的例子道歉
match_bystate <- function(r,p,b){
gc()
states <- sort(unique(p$state))
matches_list = list()
for(i in 1:length(states)){
p_state <- p %>%
filter(state==states[i])
matches_final <- my_matching_function(r,p,b)
matches_list[[i]] <- matches_final
}
final = do.call(rbind, matches_list)
saveRDS(final,file=paste0("../",gsub("-","",str_sub(Sys.time(),1,10)),"_match_",r,".RDS"))
}
这是我的简化版本my_matching_function
:
my_matching_function <- function(r,p,b){
p_original <- p
p <- p %>%
dplyr::select(id,city,lastname1,lastname2)
b <- b %>%
dplyr::select(city,lastname1,lastname2) %>%
dplyr::rename("lastname1_match"="lastname1",
"lastname2_match"="lastname2")
matches <- p %>%
data.table::data.table() %>%
lazy_dt(immutable = FALSE) %>%
dplyr::left_join(b, by = "city") %>%
dplyr::mutate(match_1=tidyr::replace_na(ifelse(lastname1==lastname1_match|lastname1==lastname2_match,1,0),0)) %>%
dplyr::mutate(match_2=tidyr::replace_na(ifelse(lastname2==lastname1_match|lastname2==lastname2_match,1,0),0)) %>%
as.data.frame() %>%
dplyr::mutate(sum = rowSums(across(match_1:match_2))) %>%
data.table::data.table() %>%
dplyr::mutate(final_1 = ifelse(sum>=1,1,0)) %>%
dplyr::mutate(final_2 = ifelse(sum>=2,1,0)) %>%
group_by_at(c(names(p))) %>%
dplyr::summarise(final_1 = sum(final_1),
final_2 = sum(final_2)) %>%
as.data.frame()
Sys.sleep(60)
matches_final <- p_original %>%
dplyr::left_join(matches) %>%
dplyr::mutate(raisyear=r)
return(matches_final)
}