0

我正在尝试执行一些基于一对多实现的匹配left_join。问题是——即使使用集群计算运行整个事情——基本匹配会产生一个太大而无法处理的数据集。

我收到此错误:

#Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
#Join results in more than 2^31 rows (internal vecseq reached physical limit). 
#Very likely misspecified join. 
#Check for duplicate key values in i each of which join to the same group in x over and over again. 
#If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. 
#Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

#Calls: polbur_match_nb ... dt_eval -> eval_tidy -> [ -> [.data.table -> vecseq

连接本身并没有错误指定,但我需要找到一种方法一次完成所有操作。我试图将左侧数据库划分为更合理的块(按状态)并my_matching_function为每个块运行。

然而,即使对于 p 和 b 都有一个微样本数据库,这也是非常缓慢的。我应该怎么做才能加快整个过程?代码中有什么可以改进的吗

为缺乏可重复的例子道歉

match_bystate <- function(r,p,b){
  
  gc()
  states <- sort(unique(p$state))
  
  matches_list = list()
  for(i in 1:length(states)){
    p_state <- p %>% 
      filter(state==states[i])
    matches_final <- my_matching_function(r,p,b)
    matches_list[[i]] <- matches_final
  }
  
  final = do.call(rbind, matches_list)
  saveRDS(final,file=paste0("../",gsub("-","",str_sub(Sys.time(),1,10)),"_match_",r,".RDS"))
  
}

这是我的简化版本my_matching_function

my_matching_function <- function(r,p,b){

  p_original <- p
  
  p <- p %>% 
    dplyr::select(id,city,lastname1,lastname2) 

  b <- b %>% 
    dplyr::select(city,lastname1,lastname2) %>% 
    dplyr::rename("lastname1_match"="lastname1",
                  "lastname2_match"="lastname2") 

  matches <- p %>% 
    data.table::data.table() %>%  
    lazy_dt(immutable = FALSE) %>% 
    dplyr::left_join(b, by = "city") %>% 
 dplyr::mutate(match_1=tidyr::replace_na(ifelse(lastname1==lastname1_match|lastname1==lastname2_match,1,0),0)) %>% 
    dplyr::mutate(match_2=tidyr::replace_na(ifelse(lastname2==lastname1_match|lastname2==lastname2_match,1,0),0)) %>%
    as.data.frame() %>% 
    dplyr::mutate(sum = rowSums(across(match_1:match_2))) %>%
    data.table::data.table() %>%  
    dplyr::mutate(final_1 = ifelse(sum>=1,1,0)) %>%
    dplyr::mutate(final_2 = ifelse(sum>=2,1,0)) %>% 
    group_by_at(c(names(p))) %>%
    dplyr::summarise(final_1 = sum(final_1),
                     final_2 = sum(final_2)) %>% 
    as.data.frame()
  Sys.sleep(60)
  
  matches_final <- p_original %>% 
    dplyr::left_join(matches) %>% 
    dplyr::mutate(raisyear=r)
  
  return(matches_final) 
}



4

0 回答 0