r - R：如何将测试数据映射到由训练数据创建的 lsa 空间

Question

我正在尝试使用 LSA 进行文本分析。我在 StackOverflow 上阅读了许多其他关于 LSA 的帖子，但我还没有找到与我的类似的帖子。如果你知道有一个和我类似的，请把我重定向到它！非常感激！

这是我创建的示例数据的可重现代码：

创建样本数据训练和测试集

sentiment = c(1,1,0,1,0,1,0,0,1,0)
length(sentiment) #10
text = c('im happy', 'this is good', 'what a bummer X(', 'today is kinda okay day for me', 'i somehow messed up big time', 
         'guess not being promoted is not too bad :]', 'stayhing home is boring :(', 'kids wont stop crying QQ', 'warriors are legendary!', 'stop reading my tweets!!!')
train_data = data.table(as.factor(sentiment), text)
> train_data
    sentiment                                text
 1:  1                                   im happy
 2:  1                               this is good
 3:  0                           what a bummer X(
 4:  1             today is kinda okay day for me
 5:  0               i somehow messed up big time
 6:  1 guess not being promoted is not too bad :]
 7:  0                 stayhing home is boring :(
 8:  0                   kids wont stop crying QQ
 9:  1                    warriors are legendary!
10:  0                  stop reading my tweets!!!

sentiment = c(0,1,0,0)
text = c('running out of things to say...', 'if you are still reading, good for you!', 'nothing ended on a good note today', 'seriously sleep deprived!! >__<')
test_data = data.table(as.factor(sentiment), text)
> train_data
   sentiment                                    text
1:         0         running out of things to say...
2:         1 if you are still reading, good for you!
3:         0      nothing ended on a good note today
4:         0         seriously sleep deprived!! >__<

训练数据集的预处理

corpus.train = Corpus(VectorSource(train_data$text))

为训练集创建术语文档矩阵

tdm.train = TermDocumentMatrix(
  corpus.train,
  control = list(
    removePunctuation = TRUE,
    stopwords = stopwords(kind = "en"),
    stemming = function(word) wordStem(word, language = "english"),
    removeNumbers = TRUE, 
    tolower = TRUE,
    weighting = weightTfIdf)
)

转换成矩阵（供以后使用）

train_matrix = as.matrix(tdm.train)

使用训练数据创建 lsa 空间

lsa.train = lsa(tdm.train, dimcalc_share())

设置维度#（我在这里随机选择了一个，因为数据太小，无法创建肘形）

k = 6

将训练矩阵投影到新的 LSA 空间中

projected.train = fold_in(docvecs = train_matrix, LSAspace = lsa.train)[1:k,]

将以上投影数据转换为矩阵

projected.train.matrix = matrix(projected.train, 
                                nrow = dim(projected.train)[1],
                                ncol = dim(projected.train)[2])

训练随机森林模型（不知何故，这个步骤不再适用于这个小样本数据......但没关系，在这个问题上不会是一个大问题；但是，如果你也可以帮助我解决这个错误，那'太棒了！我尝试用谷歌搜索这个错误，但它只是没有修复......）

trcontrol_rf = trainControl(method = "boot", p = .75, trim = T)
model_train_caret = train(x = t(projected.train.matrix), y = train_data$sentiment, method = "rf", trControl = trcontrol_rf)

测试数据集的预处理

基本上我在重复我对训练数据集所做的一切，除了我没有使用测试集来创建自己的 LSA 空间

corpus.test = Corpus(VectorSource(test_data$text))

为测试集创建术语文档矩阵

tdm.test = TermDocumentMatrix(
  corpus.test,
  control = list(
    removePunctuation = TRUE,
    stopwords = stopwords(kind = "en"),
    stemming = function(word) wordStem(word, language = "english"),
    removeNumbers = TRUE, 
    tolower = TRUE,
    weighting = weightTfIdf)
)

转换成矩阵（供以后使用）

test_matrix = as.matrix(tdm.test)

将测试矩阵投影到训练有素的 LSA 空间中（这里是问题所在）

projected.test = fold_in(docvecs = test_matrix, LSAspace = lsa.train)

但我会收到一个错误： crossprod 中的错误（docvecs，LSAspace$tk）：不符合要求的参数

我没有找到有关此错误的任何有用的谷歌搜索结果......（谷歌QQ只有一个搜索结果页面）非常感谢任何帮助！谢谢！

score 2 · Accepted Answer

当您构建 LSA 模型时，您使用的是训练数据的词汇表。但是，当您为测试数据构建 TermDocumentMatrix 时，您使用的是测试数据的词汇表。LSA 模型只知道如何处理根据训练数据的词汇表列出的文档。

解决此问题的一种方法是创建测试 TDM，并将其dictionary设置为训练数据的词汇表：

tdm.test = TermDocumentMatrix(
    corpus.test,
    control = list(
        removeNumbers = TRUE, 
        tolower = TRUE,
        stopwords = stopwords("en"),
        stemming = TRUE,
        removePunctuation = TRUE,
        weighting = weightTfIdf,
        dictionary=rownames(tdm.train)
    )
)

r - R：如何将测试数据映射到由训练数据创建的 lsa 空间

这是我创建的示例数据的可重现代码：

创建样本数据训练和测试集

训练数据集的预处理

测试数据集的预处理

将测试矩阵投影到训练有素的 LSA 空间中（这里是问题所在）

1 回答 1

Related

Reference