2

I'm working on a text multi-class classification project and I need to build the document / term matrices and train and test in R language.

I already have datasets that don't fit in the limited dimensionality of the base matrix class in R and would need to build big sparse matrices to be able to classify for example, 100k tweets. I am using the quanteda package, as it has been for now more useful and reliable than the package tm, where creating a DocumentTermMatrix with a dictionary, makes the process incredibly memory hungry with small datasets. Currently, as I said, I use quanteda to build the equivalent Document Term Matrix container that later on I transform into a data.frame to perform the training.

I want to know if there is a way to build such big matrices. I have been reading about the bigmemory package that allows this kind of container but I am not sure it will work with caret for the later classification. Overall I want to understand the problem and build a workaround to be able to work with bigger datasets, as the RAM is not a (big) problem (32GB) but I'm trying to find a way to do it and I feel completely lost about it.

4

1 回答 1

6

At what moment did you reach ram constraints?

quanteda is good package to work with NLP on medium datasets. But also I suggest to try my text2vec package. Generally it is considerably memory friendly and doesn't require to load all the raw text into the RAM (for example it can create DTM for wikipedia dump on a 16gb laptop).

Second point is that I strongly don't recommend to convert data into data.frame. Try to work with sparseMatrix objects directly.

Following method will work good for text classification:

  1. logistic regression with L1 penalty (see glmnet package)
  2. Linear SVM (see LiblineaR, but worth to serach for alternatives)
  3. Also worth to try `xgboost. I would prefer linear models. So you can try linear booster.
于 2016-08-04T08:06:50.937 回答