我在帖子中有完全相同的问题,除了我使用#quanteda 为 svm 模型生成一个 dfm(因为我需要有完全相同的 dfms 用于交叉验证预测): How to recreate same DocumentTermMatrix with new (test) data
但是,我的训练集(trainingtfidf,如文章中的粗略1.dtm)在我的测试集中有 170000+ 个文档和 670000+ 个(testtfidf,如文章中的粗略2.dtm),因此我无法将我的新测试集转换为矩阵或数据框:
>testtfidf <- as.data.frame(testtfidf)
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
因此,我尝试直接将其作为 dfm 进行:
# Keep the column names in test set which are shared with training set
testtfidf1 <- testtfidf[, intersect(colnames(testtfidf), colnames(trainingtfidf))]
# Extracting column names in training set but not in testset
namevactor <- colnames(protocoltfidf)[which(!colnames(protocoltfidf) %in% colnames(testtfidf1)==TRUE)]
# Add the columns back to test set and set the elements as NA since the terms do that exist in the test set
testtfidf1[,namevactor] <- NA
但它给了我最后一行的错误:
Error in intI(i, n = di[margin], dn = dn[[margin]], give.dn = FALSE) :
invalid character indexing
任何人都可以帮我解决这个问题吗?我已经苦苦挣扎了两天,我很接近完成这项工作!谢谢!