0

I'm working on classifying new Reddit data using an SVM model using e1071 library. My process so far has been:

  1. Label data w/ 3 categories (positive, neutral, negative)
  2. Train SVM using e1071 library.
  3. Pull new data the next day from Reddit and attempt to classify with predict()
# MOdel trained on labeled data
svm_model <- e1071::svm(
        x=train_dfm,
        y=train_label,
        type = 'C',
        kernel = 'linear'
      )

# Grab new data from from Reddit and read it in
new_data = read.csv('todays_reddit_data.csv',
                   stringsAsFactors = FALSE)

# Create a dfm
new_corp = corpus(train, text_field = "text")
new_dfm = as.matrix(dfm(train_corp))

# Error here
pred <- predict(svm_model, subset_new_data)

Error in newdata[, object$scaled, drop = FALSE]: (subscript) logical subscript too long
Traceback:

1. predict(svm_model, new_dfm)
2. predict.svm(svm_model, new_dfm)
3. scale_data_frame(newdata[, object$scaled, drop = FALSE], center = object$x.scale$"scaled:center", 
 .     scale = object$x.scale$"scaled:scale")
4. is.data.frame(x)
Error in newdata[, object$scaled, drop = FALSE]: (subscript) logical subscript too long


I get an error from predict(my_svm_model, new_reddit_data). I understand that it is due to the new Reddit data containing new tokens/features but really don't understand how to remedy. I tried dropping with this but same error:

to_drop = names(new_dfm) %in% colnames(train_dfm)
to_keep = intersect(colnames(new_dfm), colnames(train_dfm))
subset_new_data = new_dfm[,to_keep]
4

0 回答 0