我是文本分析的新手,目前正在尝试 R 中的#Quanteda 包以满足我的需要。我想为某些特定分配不同的数字权重并测试模型的准确性。我尝试了此处其他线程中提到的方法,方法是保留 dfm 类, 将权重分配给 R 中的不同特征,但无法获得正确的输出。任何帮助,将不胜感激。
这是我尝试过的
##install.packages("quanteda")
require(quanteda)
str <- c("apple is better than banana", "banana banana apple much
better","much much better new banana")
weights <- c(apple = 5, banana = 3, much = 0.5)
myDfm <- dfm(str, remove = stopwords("english"), verbose = FALSE)
#output
##Document-feature matrix of: 3 documents, 5 features.
##3 x 5 sparse Matrix of class "dfmSparse"
## features
##docs apple better banana much new
##text1 1 1 1 0 0
##text2 1 1 2 1 0
##text3 0 1 1 2 1
newweights <- weights[featnames(myDfm)]
# reassign 1 to non-matched NAs
newweights[is.na(newweights)] <- 1
# this does not works for me - see the output
myDfm * newweights
##output
##Document-feature matrix of: 3 documents, 5 features.
##3 x 5 sparse Matrix of class "dfmSparse"
## features
##docs apple better banana much new
##text1 5 0.5 1.0 0 0
##text2 1 1.0 6.0 5 0
##text3 0 5.0 0.5 2 1
环境细节
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 2.2
year 2015
month 08
day 14
svn rev 69053
language R
version.string R version 3.2.2 (2015-08-14) 昵称 Fire Safety