r - 大数据集上的 stri_replace_all_fixed 慢 - 有替代方案吗？

Question

我正在尝试通过使用stri_replace_all_fixed函数来阻止 R 中的约 4000 个文档。但是，它非常慢，因为我的词干词典包含大约。30 万字。我这样做是因为文档是丹麦语的，因此 Porter Stemmer Algortihm 没有用（它太激进了）。

我已经发布了下面的代码。有谁知道这样做的替代方法？

逻辑：查看每个文档中的每个单词 -> 如果 word = voc-table 中的单词，则替换为 tran-word。

##Read in the dictionary
 voc <- read.table("danish.csv", header = TRUE, sep=";")
#Using the library 'stringi' to make the stemming
 library(stringi)
#Split the voc corpus and put the word and stem column into different corpus
 word <- Corpus(VectorSource(voc))[1]
 tran <- Corpus(VectorSource(voc))[2]
#Using stri_replace_all_fixed to stem words
## !! NOTE THAT THE FOLLOWING STEP MIGHT TAKE A FEW MINUTES DEPENDING ON THE SIZE !! ##
 docs <- tm_map(docs, function(x) stri_replace_all_fixed(x, word, tran, vectorize_all = FALSE))

“voc”数据框的结构：

       Word           Stem
1      abandonnere    abandonner
2      abandonnerede  abandonner
3      abandonnerende abandonner
...
313273 åsyns          åsyn

score 0 · Accepted Answer

要使字典快速前进，您需要实现一些巧妙的数据结构，例如前缀树。300000x 搜索和替换只是无法扩展。

我认为这在 R 中不会有效，但您需要编写 C 或 C++ 扩展。你在那里有许多微小的操作，当你尝试在纯 R 中执行此操作时，R 解释器的开销会杀死你。

r - 大数据集上的 stri_replace_all_fixed 慢 - 有替代方案吗？

1 回答 1

Related

Reference