我试图在电子邮件列表中找到约 10.000 个不同位置的出现。我需要的是每个电子邮件中最常提及位置的一个向量,一个第二个最频繁的向量,一个第三个最频繁的向量!
由于我的数据集很大,因此我的性能存在问题。我用 stringi 和并行包尝试了它,但它仍然运行得很慢(20.000 个电子邮件和 10.000 个位置大约需要 15 分钟)。输入数据(电子邮件和城市)如下所示:
SearchVector = c('Berlin, 'Amsterdam', San Francisco', 'Los Angeles') ...
g$Message = c('This is the first mail from paris. Berlin is a nice place', 'This is the 2nd mail from San francisco. Beirut is a nice place to stay', 'This is the 3rd mail. Los Angeles is a great place') ...
这是我使用 stringi 的代码:
# libraries
library(doParallel)
library(stringi)
detectCores()
registerDoParallel(cores=7)
getDoParWorkers()
# function
getCount <- function(data, keyword)
{
keyword2 = paste0( "^(", keyword, ")|(", keyword, ")$|[ ](", keyword, ")[ ]" )
wcount <- stri_count(data, regex=keyword2)
return(data.frame(wcount))
}
SearchVector = as.vector(countryList2)
Text = g$Message
cityName1 = character()
cityName2 = character()
result = foreach(i=Text, .combine=rbind, .inorder=FALSE, .packages=c('stringi'), .errorhandling=c('remove')) %dopar%
{
cities = as.data.frame(t(getCount(i, SearchVector)))
colnames(cities) = SearchVector
if ( length(cities[which(cities > 0)]) == 1 ) {
cityName1 = names(sort(cities, decreasing = TRUE))[1]
cityName2 = NA
}
else if ( length(cities[which(cities > 0)]) > 1 ) {
cityName1 = names(sort(cities, decreasing = TRUE))[1]
cityName2 = names(sort(cities, decreasing = TRUE))[2]
}
else {
cityName1 = NA
cityName2 = NA
}
return(data.frame(cityName1, cityName2))
}
g$cityName1 = result[, 1]
g$cityName2 = result[, 2]
有什么想法可以通过例如使用 index 或 equal 来加快速度吗?我真的很期待在这个问题上获得帮助。
非常感谢克莱门斯