nlp - word2vec 中的命令行参数

Question

我想使用 word2vec 来创建我自己的带有当前版本的英语维基百科的词向量语料库，但我找不到使用该程序的命令行参数的解释。在 demp 脚本中，您可以找到以下内容：
（text8 是 2006 年的旧维基百科语料库）

make
if [ ! -e text8 ]; then
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
fi
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
./distance vectors.bin

命令行参数是什么意思：
vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15

当我有大约 20GB（.txt 文件）的维基百科文本语料库时，最合适的值是什么？我读到对于更大的语料库，300 或 500 的向量大小会更好。

score 2 · Accepted Answer

您可以检查 word2vec.c 的 main() 和每个选项的解释，如下所示

printf("WORD VECTOR estimation toolkit v 0.1c\n\n");
printf("Options:\n");
printf("Parameters for training:\n");
printf("\t-train <file>\n");
printf("\t\tUse text data from <file> to train the model\n");...`

关于最合适的值，非常抱歉我不知道答案，但您可以从源站点的“性能”段落中找到一些提示（Word2Vec - Google 代码）。它说，

 - architecture: skip-gram (slower, better for infrequent words) vs CBOW (fast)
 - the training algorithm: hierarchical softmax (better for infrequent words) vs negative sampling (better for frequent words, better with low dimensional vectors)
 - sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 1e-3 to 1e-5)
 - dimensionality of the word vectors: usually more is better, but not always
 - context (window) size: for skip-gram usually around 10, for CBOW around 5

score 2 · Accepted Answer

参数含义：

-traintext8：您将在其上训练模型的语料库

-outputvector.bin：学习完模型后，将其保存为二进制格式以供以后加载和使用

-cbow1：激活“连续词袋”选项

-size200：每个单词的向量将用 200 个值表示

对于 word2vec 的新用户，您可以通过 python 使用它的实现gensim

nlp - word2vec 中的命令行参数

2 回答 2

Related

Reference