68

From the word2vec site I can download GoogleNews-vectors-negative300.bin.gz. The .bin file (about 3.4GB) is a binary format not useful to me. Tomas Mikolov assures us that "It should be fairly straightforward to convert the binary format to text format (though that will take more disk space). Check the code in the distance tool, it's rather trivial to read the binary file." Unfortunately, I don't know enough C to understand http://word2vec.googlecode.com/svn/trunk/distance.c.

Supposedly gensim can do this also, but all the tutorials I've found seem to be about converting from text, not the other way.

Can someone suggest modifications to the C code or instructions for gensim to emit text?

4

10 回答 10

93

我使用此代码加载二进制模型,然后将模型保存到文本文件,

from gensim.models.keyedvectors import KeyedVectors

model = KeyedVectors.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)

参考:APInullege

笔记:

以上代码适用于版本的 gensim。对于以前的版本,我使用了以下代码

from gensim.models import word2vec

model = word2vec.Word2Vec.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)
于 2015-10-17T06:36:16.697 回答
20

在 word2vec-toolkit 邮件列表中,Thomas Mensink 以一个小型 C 程序的形式提供了一个答案,该程序会将 .bin 文件转换为文本。这是对 distance.c 文件的修改。我用下面的 Thomas 代码替换了原始的 distance.c 并重建了 word2vec(make clean; make),并将编译后的距离重命名为 readbin。然后./readbin vector.bin将创建一个文本版本的vector.bin。

//  Copyright 2013 Google Inc. All Rights Reserved.
//
//  Licensed under the Apache License, Version 2.0 (the "License");
//  you may not use this file except in compliance with the License.
//  You may obtain a copy of the License at
//
//      http://www.apache.org/licenses/LICENSE-2.0
//
//  Unless required by applicable law or agreed to in writing, software
//  distributed under the License is distributed on an "AS IS" BASIS,
//  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
//  See the License for the specific language governing permissions and
//  limitations under the License.

#include <stdio.h>
#include <string.h>
#include <math.h>
#include <malloc.h>

const long long max_size = 2000;         // max length of strings
const long long N = 40;                  // number of closest words that will be shown
const long long max_w = 50;              // max length of vocabulary entries

int main(int argc, char **argv) {
  FILE *f;
  char file_name[max_size];
  float len;
  long long words, size, a, b;
  char ch;
  float *M;
  char *vocab;
  if (argc < 2) {
    printf("Usage: ./distance <FILE>\nwhere FILE contains word projections in the BINARY FORMAT\n");
    return 0;
  }
  strcpy(file_name, argv[1]);
  f = fopen(file_name, "rb");
  if (f == NULL) {
    printf("Input file not found\n");
    return -1;
  }
  fscanf(f, "%lld", &words);
  fscanf(f, "%lld", &size);
  vocab = (char *)malloc((long long)words * max_w * sizeof(char));
  M = (float *)malloc((long long)words * (long long)size * sizeof(float));
  if (M == NULL) {
    printf("Cannot allocate memory: %lld MB    %lld  %lld\n", (long long)words * size * sizeof(float) / 1048576, words, size);
    return -1;
  }
  for (b = 0; b < words; b++) {
    fscanf(f, "%s%c", &vocab[b * max_w], &ch);
    for (a = 0; a < size; a++) fread(&M[a + b * size], sizeof(float), 1, f);
    len = 0;
    for (a = 0; a < size; a++) len += M[a + b * size] * M[a + b * size];
    len = sqrt(len);
    for (a = 0; a < size; a++) M[a + b * size] /= len;
  }
  fclose(f);
  //Code added by Thomas Mensink
  //output the vectors of the binary format in text
  printf("%lld %lld #File: %s\n",words,size,file_name);
  for (a = 0; a < words; a++){
    printf("%s ",&vocab[a * max_w]);
    for (b = 0; b< size; b++){ printf("%f ",M[a*size + b]); }
    printf("\b\b\n");
  }  

  return 0;
}

我从printf.

顺便说一句,生成的文本文件仍然包含文本单词和一些不必要的空格,我不希望这些空格用于一些数值计算。我使用 bash 命令从每行中删除了初始文本列和尾随空格。

cut --complement -d ' ' -f 1 GoogleNews-vectors-negative300.txt > GoogleNews-vectors-negative300_tuples-only.txt
sed 's/ $//' GoogleNews-vectors-negative300_tuples-only.txt
于 2014-12-06T06:49:04.930 回答
8

格式是 IEEE 754 单精度二进制浮点格式: binary32 http://en.wikipedia.org/wiki/Single-precision_floating-point_format 他们使用 little-endian。

举个例子:

  • 第一行是字符串格式:"3000000 300\n" (vocabSize & vecSize, getByte until byte=='\n')
  • 下一行首先包含词汇,然后是(300*4 字节的浮点值,每个维度 4 字节):

    getByte till byte==32 (space). (60 47 115 62 32 => <\s>[space])
    
  • 那么接下来的每个 4 字节将代表一个浮点数

    下一个 4 字节:0 0 -108 58 => 0.001129150390625。

您可以查看 wikipedia 链接以了解如何操作,让我以这个为例:

(little-endian -> 逆序) 00111010 10010100 00000000 00000000

  • 首先是符号位 => 符号 = 1 (else = -1)
  • 接下来 8 位 => 117 => exp = 2^(117-127)
  • 接下来的 23 位 => pre = 0*2^(-1) + 0*2^(-2) + 1*2^(-3) + 1*2^(-5)

值 = 符号 * exp * pre

于 2015-03-25T09:34:26.250 回答
8

您可以在 word2vec 中加载二进制文件,然后像这样保存文本版本:

from gensim.models import word2vec
 model = word2vec.Word2Vec.load_word2vec_format('Path/to/GoogleNews-vectors-negative300.bin', binary=True)
 model.save("file.txt")

`

于 2015-05-12T21:56:25.427 回答
7

我正在使用 gensim 与 GoogleNews-vectors-negative300.bin 一起工作,并且binary = True在加载模型时包含一个标志。

from gensim import word2vec

model = word2vec.Word2Vec.load_word2vec_format('Path/to/GoogleNews-vectors-negative300.bin', binary=True) 

似乎工作正常。

于 2015-02-06T07:39:14.770 回答
4

如果您收到错误:

ImportError: No module named models.word2vec

那是因为有一个 API 更新。这将起作用:

from gensim.models.keyedvectors import KeyedVectors

model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('./GoogleNews-vectors-negative300.txt', binary=False)
于 2017-04-02T00:13:37.520 回答
2

我有一个类似的问题,我想将 bin/non-bin(gensim) 模型输出为 CSV。

这是在 python 上执行此操作的代码,它假设您已安装 gensim:

https://gist.github.com/dav009/10a742de43246210f3ba

于 2015-02-19T11:32:34.357 回答
2

这是我使用的代码:

import codecs
from gensim.models import Word2Vec

def main():
    path_to_model = 'GoogleNews-vectors-negative300.bin'
    output_file = 'GoogleNews-vectors-negative300_test.txt'
    export_to_file(path_to_model, output_file)


def export_to_file(path_to_model, output_file):
    output = codecs.open(output_file, 'w' , 'utf-8')
    model = Word2Vec.load_word2vec_format(path_to_model, binary=True)
    print('done loading Word2Vec')
    vocab = model.vocab
    for mid in vocab:
        #print(model[mid])
        #print(mid)
        vector = list()
        for dimension in model[mid]:
            vector.append(str(dimension))
        #line = { "mid": mid, "vector": vector  }
        vector_str = ",".join(vector)
        line = mid + "\t"  + vector_str
        #line = json.dumps(line)
        output.write(line + "\n")
    output.close()

if __name__ == "__main__":
    main()
    #cProfile.run('main()') # if you want to do some profiling
于 2015-09-01T18:34:37.383 回答
2

convertvec是一个用于在 word2vec 库的不同格式之间转换向量的小工具。

将向量从二进制转换为纯文本:

./convertvec bin2txt input.bin output.txt

将向量从纯文本转换为二进制:

./convertvec txt2bin input.txt output.bin

于 2016-09-19T06:22:44.760 回答
0

只是一个快速更新,因为现在有更简单的方法。

如果您word2vechttps://github.com/dav/word2vec使用,还有一个名为-binarywhich accept的附加选项1可以生成二进制文件或0生成文本文件。这个例子来自demo-word.shrepo:

time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 0 -iter 15

于 2016-10-19T09:31:57.110 回答