r - 如何仅从存储的单词列表中生成 text2vector 中的文档术语矩阵

Question

text2vec 中用于向量化文本并仅使用指示的单词列表实现 dtm 的语法是什么？

如何仅在指定特征上矢量化和生成文档术语矩阵？如果特征没有出现在文本中，则变量应保持为空。

我需要生成与运行建模的 dtm 中的列完全相同的术语文档矩阵，否则我无法在新文档上使用随机森林模型。

score 2 · Accepted Answer

您只能从特定的功能集创建文档术语矩阵：

v = create_vocabulary(c("word1", "word2"))
vectorizer = vocab_vectorizer(v)
dtm_test = create_dtm(it, vectorizer)

但是我不建议 1) 在这种稀疏数据上使用随机森林 - 它不会很好 2) 执行你描述的特征选择方式 - 你可能会过拟合。

score 2 · Accepted Answer

我需要生成与运行建模的 dtm 中的列完全相同的术语文档矩阵，否则我无法在新文档上使用随机森林模型。

在quanteda 中，您可以使用dfm_select(). 例如，要使dfm1以下具有与以下相同的功能dfm2：

txts <- c("a b c d", "a a b b", "b c c d e f")

(dfm1 <- dfm(txts[1:2]))
## Document-feature matrix of: 2 documents, 4 features (25% sparse).
## 2 x 4 sparse Matrix of class "dfmSparse"
##        features
## docs    a b c d
##   text1 1 1 1 1
##   text2 2 2 0 0
(dfm2 <- dfm(txts[2:3]))
## Document-feature matrix of: 2 documents, 6 features (41.7% sparse).
## 2 x 6 sparse Matrix of class "dfmSparse"
##        features
## docs    a b c d e f
##   text1 2 2 0 0 0 0
##   text2 0 1 2 1 1 1

dfm_select(dfm1, dfm2, valuetype = "fixed", verbose = TRUE)
## kept 4 features, padded 2 features
## Document-feature matrix of: 2 documents, 6 features (50% sparse).
## 2 x 6 sparse Matrix of class "dfmSparse"
##        features
## docs    a b c d e f
##   text1 1 1 1 1 0 0
##   text2 2 2 0 0 0 0

但是，对于特征上下文矩阵（text2vec输入需要什么），这将不起作用，因为共现（至少使用窗口而不是文档上下文计算的那些）在特征之间是相互依赖的，因此您不能简单地添加和删除它们以同样的方式。

r - 如何仅从存储的单词列表中生成 text2vector 中的文档术语矩阵

2 回答 2

Related

Reference