I'm working on classifying new Reddit data using an SVM model using e1071 library. My process so far has been:
- Label data w/ 3 categories (positive, neutral, negative)
- Train SVM using e1071 library.
- Pull new data the next day from Reddit and attempt to classify with predict()
# MOdel trained on labeled data
svm_model <- e1071::svm(
x=train_dfm,
y=train_label,
type = 'C',
kernel = 'linear'
)
# Grab new data from from Reddit and read it in
new_data = read.csv('todays_reddit_data.csv',
stringsAsFactors = FALSE)
# Create a dfm
new_corp = corpus(train, text_field = "text")
new_dfm = as.matrix(dfm(train_corp))
# Error here
pred <- predict(svm_model, subset_new_data)
Error in newdata[, object$scaled, drop = FALSE]: (subscript) logical subscript too long
Traceback:
1. predict(svm_model, new_dfm)
2. predict.svm(svm_model, new_dfm)
3. scale_data_frame(newdata[, object$scaled, drop = FALSE], center = object$x.scale$"scaled:center",
. scale = object$x.scale$"scaled:scale")
4. is.data.frame(x)
Error in newdata[, object$scaled, drop = FALSE]: (subscript) logical subscript too long
I get an error from predict(my_svm_model, new_reddit_data). I understand that it is due to the new Reddit data containing new tokens/features but really don't understand how to remedy. I tried dropping with this but same error:
to_drop = names(new_dfm) %in% colnames(train_dfm)
to_keep = intersect(colnames(new_dfm), colnames(train_dfm))
subset_new_data = new_dfm[,to_keep]