3

I would like to train the word2vec model on my own corpus using the rword2vec package in R.

The word2vec function that is used to train the model requires a train_file. The package's documentation in R simply notes that this is the training text data, but doesn't specify how it can be created.

The training data used in the example on GitHub can be downloaded here: http://mattmahoney.net/dc/text8.zip. I can't figure out what type of file it is.

I've looked through the README file on the rword2vec GitHub page and checked out the official word2vec page on Google Code.

My corpus is a .csv file with about 68,000 documents. File size is roughly 300MB. I realize that training the model on a corpus of this size might take a long time (or be infeasible), but I'm willing to train it on a subset of the corpus. I just don't know how to create the train_file required by the function.

4

1 回答 1

2

解压 text8 后,您可以使用文本编辑器打开它。您会看到它是一份很长的文档。您需要确定要使用 68,000 个文档中的多少用于培训,以及是否要将它们连接在一起或将它们作为单独的文档保存。请参阅https://datascience.stackexchange.com/questions/11077/using-several-documents-with-word2vec

于 2020-06-30T15:59:22.377 回答