I would like to train the word2vec model on my own corpus using the rword2vec
package in R.
The word2vec
function that is used to train the model requires a train_file
. The package's documentation in R simply notes that this is the training text data, but doesn't specify how it can be created.
The training data used in the example on GitHub can be downloaded here: http://mattmahoney.net/dc/text8.zip. I can't figure out what type of file it is.
I've looked through the README file on the rword2vec GitHub page and checked out the official word2vec page on Google Code.
My corpus is a .csv
file with about 68,000 documents. File size is roughly 300MB. I realize that training the model on a corpus of this size might take a long time (or be infeasible), but I'm willing to train it on a subset of the corpus. I just don't know how to create the train_file
required by the function.