word2vec - Handling OOV words in GoogleNews-vectors-negative300.bin

Question

I need to calculate the word vectors for each word of a sentence that is tokenized as follows:

['my', 'aunt', 'give', 'me', 'a', 'teddy', 'ruxpin'].

If I was using the pretrained [fastText][1] Embeddings: cc.en.300.bin.gz by facebook. I could get by OOV. However, when I use Google's word2vec from GoogleNews-vectors-negative300.bin, it returns an InvalidKey Error. My question is how to we calculate the word vectors that are OOV then? I searched online I could not find anything. Of course on way to do this is removing all the sentences that have words not listed in the google's word2vec. However, I noticed only 5550 out of 16134 have words completely in the embedding.

I did also

model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/My Drive/Colab Notebooks/GoogleNews-vectors-negative300.bin', binary=True) 
model.train(sentences_with_OOV_words)

However, tensorflow 2 returns an error.

Any help would be greatly appreciate it.

score 1 · Accepted Answer

If vocab is not found, initialize them with zero vector of the same size (Google word2vec should be a vector of 300 dimensions):

try:
    word_vector = model.wv.get_vector('your_word_here')

except KeyError:
    word_vector = np.zeros((300,))

score 1 · Accepted Answer

The GoogleNews vector set is a plain mapping of words to vectors. There's no facility in it (or the algorithms that created it) for synthesizing vectors for unknown words.

(Similarly, if you load a plain vector-set into gensim as a KeyedVectors, there's no opportunity to run train() on the resulting object, as you show in your question code. It's not a full trainable model, just a collection of vectors.)

You can check if a word is available, using the in keyword. As other answers have noted, you can then choose to use some plug value (such as an all-zeros vector) for such words.

But it's often better to just ignore such words entirely – pretend they're not even in your text. (Using a zero-vector instead, then feeding that zero-vector into other parts of your system, can make those unknown-words essentially dilute the influence of other nearby word-vectors – which often isn't what you want.)

score 0 · Accepted Answer

Awesome! Thank you very much.

def get_vectorOOV(s):
  try:
    return np.array(model.wv.get_vector(s))
  except KeyError:
    return np.zeros((300,))

word2vec - Handling OOV words in GoogleNews-vectors-negative300.bin

3 回答 3

Related

Reference