我曾经quanteda::textmodel_NB
创建一个模型,将文本分类为两个类别之一。我将模型拟合到去年夏天的训练数据集上。
现在,我正试图在今年夏天使用它来对我们在工作中获得的新文本进行分类。我尝试这样做并收到以下错误:
Error in predict.textmodel_NB_fitted(model, test_dfm) :
feature set in newdata different from that in training set
生成错误的函数中的代码可以在第 157 到 165 行找到。
我认为这是因为训练数据集中的单词与测试数据集中使用的单词不完全匹配。但是为什么会出现这个错误呢?我觉得好像——为了在现实世界的例子中有用——该模型应该能够处理包含不同特征的数据集,因为这可能在应用中总是会发生。
所以我的第一个问题是:
1. 这个错误是朴素贝叶斯算法的属性吗?或者是函数的作者做出了这样的选择?
然后引出了我的第二个问题:
2. 我该如何解决这个问题?
为了解决第二个问题,我提供了可重现的代码(最后一行生成了上面的错误):
library(quanteda)
library(magrittr)
library(data.table)
train_text <- c("Can random effects apply only to categorical variables?",
"ANOVA expectation identity",
"Statistical test for significance in ranking positions",
"Is Fisher Sharp Null Hypothesis testable?",
"List major reasons for different results from survival analysis among different studies",
"How do the tenses and aspects in English correspond temporally to one another?",
"Is there a correct gender-neutral singular pronoun (“his” vs. “her” vs. “their”)?",
"Are collective nouns always plural, or are certain ones singular?",
"What’s the rule for using “who” and “whom” correctly?",
"When is a gerund supposed to be preceded by a possessive adjective/determiner?")
train_class <- factor(c(rep(0,5), rep(1,5)))
train_dfm <- train_text %>%
dfm(tolower=TRUE, stem=TRUE, remove=stopwords("english"))
model <- textmodel_NB(train_dfm, train_class)
test_text <- c("Weighted Linear Regression with Proportional Standard Deviations in R",
"What do significance tests for adjusted means tell us?",
"How should I punctuate around quotes?",
"Should I put a comma before the last item in a list?")
test_dfm <- test_text %>%
dfm(tolower=TRUE, stem=TRUE, remove=stopwords("english"))
predict(model, test_dfm)
我唯一想做的就是手动使特征相同(我假设这会填充0
对象中不存在的特征),但这会产生一个新错误。上面示例的代码是:
model_features <- model$data$x@Dimnames$features # gets the features of the training data
test_features <- test_dfm@Dimnames$features # gets the features of the test data
all_features <- c(model_features, test_features) %>% # combining the two sets of features...
subset(!duplicated(.)) # ...and getting rid of duplicate features
model$data$x@Dimnames$features <- test_dfm@Dimnames$features <- all_features # replacing features of model and test_dfm with all_features
predict(model, dfm) # new error?
但是,此代码会生成一个新错误:
Error in if (ncol(object$PcGw) != ncol(newdata)) stop("feature set in newdata different from that in training set") :
argument is of length zero
如何将此朴素贝叶斯模型应用于具有不同特征的新数据集?