r - R：使用新数据重新创建相同的文档术语矩阵

Question

我在帖子中有完全相同的问题，除了我使用#quanteda 为 svm 模型生成一个 dfm（因为我需要有完全相同的 dfms 用于交叉验证预测）： How to recreate same DocumentTermMatrix with new (test) data

但是，我的训练集（trainingtfidf，如文章中的粗略1.dtm）在我的测试集中有 170000+ 个文档和 670000+ 个（testtfidf，如文章中的粗略2.dtm），因此我无法将我的新测试集转换为矩阵或数据框：

>testtfidf <- as.data.frame(testtfidf)
Error in asMethod(object) : 
      Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

因此，我尝试直接将其作为 dfm 进行：

# Keep the column names in test set which are shared with training set
testtfidf1 <- testtfidf[, intersect(colnames(testtfidf), colnames(trainingtfidf))]
# Extracting column names in training set but not in testset
namevactor <- colnames(protocoltfidf)[which(!colnames(protocoltfidf) %in% colnames(testtfidf1)==TRUE)]
# Add the columns back to test set and set the elements as NA since the terms do that exist in the test set
testtfidf1[,namevactor] <- NA

但它给了我最后一行的错误：

Error in intI(i, n = di[margin], dn = dn[[margin]], give.dn = FALSE) : 
invalid character indexing

任何人都可以帮我解决这个问题吗？我已经苦苦挣扎了两天，我很接近完成这项工作！谢谢！

score 0 · Accepted Answer

这个答案仍然有点粗糙，但我认为这可能是问题所在。看起来您正在使用包 Tsparse.R。它有一个函数 intI()。函数 intI() 在帖子底部定义。这是您的错误发生的地方。但是，您可以完全避免使用该功能。考虑以下：

看来protocoltfidf是原始数据集。代码片段的第二行从 protocoltfidf 中提取不在测试数据集中的列名。所以“namesvactor”是一个字符串向量，它们都不是testtfidf1中的列名。

这可能过度简化了问题，但您的问题可能只是您试图将 NA 值分配给 testtfidf1 中甚至不存在的列。请记住，“namesvactor”包含 testtfidf1 中不存在的列名字符串。因此 testtfidf1[,namevacor] 行引用了 testtfidf1 中甚至不存在的列。这可能就是它在查找这些列时遇到问题的原因。

也许尝试在 testtfidf1 中创建新列，列名是“namesvactor”中的字符串，并将这些列中的值设置为 NA。

intI <- function(i, n, dn, give.dn = TRUE)
{
## Purpose: translate numeric | logical | character index
##      into 0-based integer
## ----------------------------------------------------------------------
## Arguments: i: index vector (numeric | logical | character)
##        n: array extent           { ==  dim(.) [margin] }
##       dn: character col/rownames or NULL { == dimnames(.)[[margin]] }
## ----------------------------------------------------------------------
## Author: Martin Maechler, Date: 23 Apr 2007

has.dn <- !is.null.DN(dn)
DN <- has.dn && give.dn
if(is(i, "numeric")) {
storage.mode(i) <- "integer"
if(anyNA(i))
    stop("'NA' indices are not (yet?) supported for sparse Matrices")
if(any(i < 0L)) {
    if(any(i > 0L))
    stop("you cannot mix negative and positive indices")
    i0 <- (0:(n - 1L))[i]
} else {
    if(length(i) && max(i, na.rm=TRUE) > n)
    stop(gettextf("index larger than maximal %d", n), domain=NA)
    if(any(z <- i == 0)) i <- i[!z]
    i0 <- i - 1L        # transform to 0-indexing
}
if(DN) dn <- dn[i]
}
else if (is(i, "logical")) {
if(length(i) > n)
    stop(gettextf("logical subscript too long (%d, should be %d)",
          length(i), n), domain=NA)
i0 <- (0:(n - 1L))[i]
if(DN) dn <- dn[i]
} else { ## character
if(!has.dn)
    stop("no 'dimnames[[.]]': cannot use character indexing")
i0 <- match(i, dn)
if(anyNA(i0)) stop("invalid character indexing")
if(DN) dn <- dn[i0]
i0 <- i0 - 1L
}
if(!give.dn) i0 else list(i0 = i0, dn = dn)
} ## {intI}

r - R：使用新数据重新创建相同的文档术语矩阵

1 回答 1

Related

Reference