0

I need to calculate the word vectors for each word of a sentence that is tokenized as follows:

['my', 'aunt', 'give', 'me', 'a', 'teddy', 'ruxpin']. 

If I was using the pretrained [fastText][1] Embeddings: cc.en.300.bin.gz by facebook. I could get by OOV. However, when I use Google's word2vec from GoogleNews-vectors-negative300.bin, it returns an InvalidKey Error. My question is how to we calculate the word vectors that are OOV then? I searched online I could not find anything. Of course on way to do this is removing all the sentences that have words not listed in the google's word2vec. However, I noticed only 5550 out of 16134 have words completely in the embedding.

I did also

model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/My Drive/Colab Notebooks/GoogleNews-vectors-negative300.bin', binary=True) 
model.train(sentences_with_OOV_words)

However, tensorflow 2 returns an error.

Any help would be greatly appreciate it.

4

3 回答 3

1

If vocab is not found, initialize them with zero vector of the same size (Google word2vec should be a vector of 300 dimensions):

try:
    word_vector = model.wv.get_vector('your_word_here')

except KeyError:
    word_vector = np.zeros((300,))
于 2019-09-16T05:12:49.830 回答
1

The GoogleNews vector set is a plain mapping of words to vectors. There's no facility in it (or the algorithms that created it) for synthesizing vectors for unknown words.

(Similarly, if you load a plain vector-set into gensim as a KeyedVectors, there's no opportunity to run train() on the resulting object, as you show in your question code. It's not a full trainable model, just a collection of vectors.)

You can check if a word is available, using the in keyword. As other answers have noted, you can then choose to use some plug value (such as an all-zeros vector) for such words.

But it's often better to just ignore such words entirely – pretend they're not even in your text. (Using a zero-vector instead, then feeding that zero-vector into other parts of your system, can make those unknown-words essentially dilute the influence of other nearby word-vectors – which often isn't what you want.)

于 2019-09-16T18:41:30.933 回答
0

Awesome! Thank you very much.

def get_vectorOOV(s):
  try:
    return np.array(model.wv.get_vector(s))
  except KeyError:
    return np.zeros((300,))
于 2019-09-16T14:05:43.137 回答