您应该?dictionary
仔细阅读,因为这不是为特征选择而设计的(尽管可以),而是在分配给字典键的值之间创建等价类。
如果您impVariables
是特征的特征向量,那么您应该能够使用这些命令来执行您想要的选择:
toks <-
tokens(mycorpus, remove_punct = TRUE) %>%
tokens_select(impVariables, padding = TRUE) %>%
tokens_wordstem() %>%
tokens_ngrams(n = 1:2)
dfm(toks)
最后一个命令生成一个文档特征矩阵,该矩阵仅包含从随机森林模型的顶级特征中选择的词干、ngram 特征。请注意,这padding = TRUE
将防止形成原始文本中从未相邻的 ngram。如果您不在乎,请将其设置为FALSE
(默认值)。
添加:
要从选择词的字符向量中选择 dfm 的列,我们可以使用以下两种方法。
我们将使用这些示例对象:
# two sample texts and their dfm representations
txt1 <- c(d1 = "a b c f g h",
d2 = "a a c c d f f f")
txt2 <- c(d1 = "c c d f g h",
d2 = "b b d i j")
(dfm1 <- dfm(txt1))
# Document-feature matrix of: 2 documents, 7 features (28.6% sparse).
# 2 x 7 sparse Matrix of class "dfmSparse"
# features
# docs a b c f g h d
# d1 1 1 1 1 1 1 0
# d2 2 0 2 3 0 0 1
(dfm2 <- dfm(txt2))
# Document-feature matrix of: 2 documents, 8 features (43.8% sparse).
# 2 x 8 sparse Matrix of class "dfmSparse"
# features
# docs c d f g h b i j
# d1 2 1 1 1 1 0 0 0
# d2 0 1 0 0 0 2 1 1
impVariables <- c("a", "c", "e", "z")
第一种方法:创建一个dfm并使用dfm_select()
在这里,我们正在从您的功能的字符向量创建DFM,以便我们将其注册为功能,因为dfm_select()
选择对象是DFM时的工作方式。
impVariablesDfm <- dfm(paste(impVariables, collapse = " "))
dfm_select(dfm1, impVariablesDfm)
# Document-feature matrix of: 2 documents, 4 features (50% sparse).
# 2 x 4 sparse Matrix of class "dfmSparse"
# features
# docs a c e z
# d1 1 1 0 0
# d2 2 2 0 0
dfm_select(dfm2, impVariablesDfm)
# Document-feature matrix of: 2 documents, 4 features (87.5% sparse).
# 2 x 4 sparse Matrix of class "dfmSparse"
# features
# docs a c e z
# d1 0 2 0 0
# d2 0 0 0 0
第二种方法:创建一个字典并使用dfm_lookup()
让我们创建一个辅助函数来从字符向量创建字典:
# make a dictionary where each key = its value
char2dictionary <- function(x) {
result <- as.list(x) # make the vector into a list
names(result) <- x
dictionary(result)
}
现在使用 dfm 查找,我们只得到键,甚至是没有观察到的键:
dfm_lookup(dfm1, dictionary = char2dictionary(impVariables))
# Document-feature matrix of: 2 documents, 4 features (50% sparse).
# 2 x 4 sparse Matrix of class "dfmSparse"
# features
# docs a c e z
# d1 1 1 0 0
# d2 2 2 0 0
dfm_lookup(dfm2, dictionary = char2dictionary(impVariables))
# Document-feature matrix of: 2 documents, 4 features (87.5% sparse).
# 2 x 4 sparse Matrix of class "dfmSparse"
# features
# docs a c e z
# d1 0 2 0 0
# d2 0 0 0 0
注意:(但第一个至少适用于 v0.9.9.65):
packageVersion("quanteda")
# [1] ‘0.9.9.85’