r - R : text2vec DTM 的文档编号与原始文档编号不正确

Question

我是一个经常使用 text2vec 的学生。

直到去年，我使用这个程序没有任何问题。

但是今天当我使用 Parallel 功能构建 DTM 时，DTM 的文档编号与原始文档编号不正确。

DTM 的文档数与原始文档数除以注册核心相匹配。所以怀疑是并行处理后没有合并结果。

我测试的附加代码。

library(stringr)
library(text2vec)
library(data.table)
library (parallel)
library (doParallel)

N <- detectCores()
cl <- makeCluster (N)
registerDoParallel (cl)

data("movie_review")

setDT(movie_review)
setkey(movie_review, id)

##number of document is 5000
IT <- itoken_parallel (movie_review$review,
                       ids          = movie_review$id,
                       tokenizer    = word_tokenizer,
                       progressbar  = F)


VOCAB <- create_vocabulary (
    IT, 
    ngram = c(1, 1)) %>%
    prune_vocabulary (term_count_min = 3)

VoCAB.order <- VOCAB[order((VOCAB$term_count), decreasing = T),]

VECTORIZER <- vocab_vectorizer (VOCAB)

DTM <- create_dtm (IT,              
                   VECTORIZER,      
                   distributed = F)

##DTM dimension is not 5000. number is 5000/4(number of Cores) = 1250
dim(DTM)

我在 Vignette 中检查了 text2vec itoken 函数。我找到了测试itoken中并行处理的示例，并且处理得很好，没有错误。

在这个过程中，如何使用停用词和最小频率功能？

N_WORKERS = 1 # change 1 to number of cores in parallel backend
if(require(doParallel)) registerDoParallel(N_WORKERS)
data("movie_review")
it = itoken_parallel(movie_review$review[1:100], n_chunks = N_WORKERS)
system.time(dtm <- create_dtm(it, hash_vectorizer(2**16), type = 'dgTMatrix'))

我期待着真诚的回答。

感谢您的关注。

score 1 · Accepted Answer

嗨，请删除distributed = F。这是一个错误（在此处distributed = F以省略号捕获）。我会修好它。感谢报告！

关于第二个问题 - 没有好的解决方案。您可以使用函数手动计算频繁/非频繁单词（实际上是哈希）colSums，但我不建议这样做。

UPD现已修复。

r - R : text2vec DTM 的文档编号与原始文档编号不正确

1 回答 1

Related

Reference