问题标签 [word2vec]
For questions regarding programming in ECMAScript (JavaScript/JS) and its various dialects/implementations (excluding ActionScript). Note JavaScript is NOT the same as Java! Please include all relevant tags on your question; e.g., [node.js], [jquery], [json], [reactjs], [angular], [ember.js], [vue.js], [typescript], [svelte], etc.
python - “文件”对象没有属性“rfind”
我正在尝试将 word2vec 保存到文件中。
我在 genericpath.py 中收到以下错误
我哪里错了?
nlp - word2vec lemmatization of corpus before training
Word2vec seems to be mostly trained on raw corpus data. However, lemmatization is a standard preprocessing for many semantic similarity tasks. I was wondering if anybody had experience in lemmatizing the corpus before training word2vec and if this is a useful preprocessing step to do.
python - 使用 Python 读取 Mikolov 训练好的词向量
我正在尝试在这里读取二进制文件。该文件包含由 Mikolov 在word2vec程序中训练的单词表示,格式如下:
前 12 个字节包含字符串:“3000000 300\n”
后续字节:“<1st variable word string>[space]<4*300 bytes to form 300 dimension float vector> [May be something there] <2nd word>....<3000000th word>[space]<4*300字节>"
使用此C
代码:
我可以读取每个单词存储在buff
和相应的向量存储在M
. 但是当我在Python
这个测试代码中尝试相同的策略时:
它产生结果:
3000000 300
</s>
真的
在真
为真
; 错误的
显然是错误的,因为第三个词必须是that
。我无法弄清楚我在这里做错了什么!
python - python word2vec没有安装
我一直在尝试使用我的 Python2.7 解释器在我的 Windows 7 机器上安装 word2vec:https ://github.com/danielfrg/word2vec
我尝试setup.py
从解压缩的目录下载 zip & running python install 并运行pip install
. 但是在这两种情况下,它都会返回以下错误:
访问似乎有问题subprocess.call()
,所以经过一番谷歌搜索后,我设法将shell=True
word2vec 添加到该行setup.py
,然后抛出此错误:
老实说,我什至不确定我应该从这里去哪里。我还尝试安装 make 并将路径变量设置为安装中的 .exe 文件,任何建议将不胜感激,谢谢。
更新:
虽然 word2vec 模块无法运行一个名为的包genism
似乎运行良好,但它也有一些很棒的其他 NLP 功能http://radimrehurek.com/gensim/
python - 保持对象在 python 的另一个程序中使用它们
我正在使用 word2vec 来计算两个单词之间的相似度。所以对于我使用 GoogleNews 的模型。该模型非常庞大,因此需要大量时间来加载。
我想加载它并保存在一个变量/对象中,这样每当我运行 python 程序时,我应该能够调用
如何做到这一点?任何想法?
python - 将 word2vc 数据文件读取到 python 时出现 MemoryError
我正在尝试在 Windows 7 中使用 word2vec。我有 24GB 的 RAM 和 i7 处理器,并且我使用的是 64 位 python。我正在尝试遵循 Radim 的教程。我想访问 word2vec 原始页面提供的 google 30 亿文件中的向量。当我运行该行时:
我收到以下错误:
我不知道如何解决这个问题,因为文件只有 1.3GB,而且我有足够的可用内存空间。
nlp - 为什么以这种方式计算 gensim.word2vec 中两个词袋之间的相似度?
这是我从 gensim.word2Vec 中摘录的代码,我知道两个单词的相似度可以通过余弦距离来计算,但是两个单词集呢?该代码似乎使用每个 wordvec 的平均值,然后计算两个平均向量的余弦距离。我对word2vec知之甚少,这样的过程是否有一些基础?
machine-learning - Word2Vec:维数
我正在将 Word2Vec 与大约 11,000,000 个标记的数据集一起使用,希望同时进行单词相似性(作为下游任务的同义词提取的一部分),但我不知道我应该在 Word2Vec 中使用多少维度。有没有人根据令牌/句子的数量对要考虑的维度范围有很好的启发式方法?
text - 如何使用单词的向量表示(从 Word2Vec 等获得)作为分类器的特征?
我熟悉使用 BOW 特征进行文本分类,其中我们首先找到语料库的词汇量大小,它成为我们特征向量的大小。对于每个句子/文档,以及它的所有组成词,我们然后根据该词在该句子/文档中的缺席/存在来放置 0/1。
但是,既然我正在尝试使用每个单词的向量表示,那么创建全局词汇表是否必不可少?
python - Load PreComputed Vectors Gensim
I am using the Gensim Python package to learn a neural language model, and I know that you can provide a training corpus to learn the model. However, there already exist many precomputed word vectors available in text format (e.g. http://www-nlp.stanford.edu/projects/glove/). Is there some way to initialize a Gensim Word2Vec model that just makes use of some precomputed vectors, rather than having to learn the vectors from scratch?
Thanks!