我在使用GloVe实现词到向量映射时遇到了麻烦。我的代码似乎工作正常,但有一个奇怪的问题:尝试将一个特定的单词 - 'the' 映射到它的向量表示时出现错误。我不知道为什么会这样。
这是我读取 GloVe 文件的代码:
def read_glove_vecs(glove_file):
with open(glove_file, 'r', encoding='utf-8', errors='ignore') as f:
words = set()
word_to_vec_map = {}
for line in f:
line = line.strip().split()
curr_word = line[0]
words.add(curr_word)
word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
i = 1
words_to_index = {}
index_to_words = {}
for w in sorted(words):
words_to_index[w] = i
index_to_words[i] = w
i = i + 1
return words_to_index, index_to_words, word_to_vec_map
如您所见,上面的函数返回变量“word_to_vec_map”,它应该将训练集中的单词映射到它们的 GloVe 表示。
这是训练集中的一个片段:
I am proud of your achievements,2,,
Miss you so much,0,, [0]
food is life,4,,
I love you mum,0,,
Stop saying bullshit,3,,
congratulations on your acceptance,2,,
The assignment is too long ,3,,
I want to go play,1,, [3]
似乎我能够使用 word_to_vec_map 映射单词:
print(word_to_vec_map['proud'])
[-0.5918 0.27671 -0.46971 -0.54743 1.3504 -0.63907 -0.6819
0.54207 -0.40552 0.11271 0.1564 0.21604 -0.035073 -0.30228
0.15753 -0.10437 0.64561 1.0843 0.28788 -0.24031 -1.2893
0.82949 -0.44547 0.11085 1.1249 -1.5474 -1.3967 0.1393
0.23133 -0.46974 1.5829 0.87095 0.13645 0.047461 -0.37914
-0.45608 0.033173 0.39443 -0.67186 -0.92765 -0.19048 -0.59441
-0.046391 0.14051 0.032863 0.42813 -1.3888 -0.20055 -0.26487
0.57981 ]
print(word_to_vec_map['much'])
[ 0.36999 0.082841 0.16883 -0.50223 0.37935 0.13343
-0.32527 -0.17964 -0.40393 0.58149 -0.14505 0.1399
-0.1566 -0.60951 0.62075 0.5596 0.35677 0.25654
-0.33583 -0.82497 -0.11897 0.21829 0.27755 -0.38194
0.54374 -1.7705 -0.74366 0.40402 0.88709 -0.021368
3.7891 0.39953 0.51627 -0.48584 -0.052367 -0.28135
-0.60422 0.46096 0.11491 -0.49699 -0.34498 0.38645
0.14052 0.43843 -0.33583 0.13546 -0.12158 0.0053184
-0.50853 0.24986 ]
print(word_to_vec_map['miss'])
[-3.2273e-01 5.6182e-01 -6.6363e-01 3.8883e-01 -4.6558e-02 2.2328e-01
-7.5691e-01 7.0853e-01 5.5714e-01 -5.9996e-02 3.1235e-01 1.6741e-01
-5.4568e-01 -3.8765e-01 1.2309e+00 3.4766e-01 -5.0017e-02 -4.9804e-02
-6.6282e-01 2.2854e-01 -7.8443e-01 6.5823e-01 5.6099e-01 3.3218e-01
5.3049e-01 -1.3611e+00 -4.9452e-01 2.7711e-01 -2.2982e-01 -1.1492e+00
1.5028e+00 1.0916e+00 -9.8464e-02 3.9349e-04 2.5753e-01 -1.5470e-01
2.7595e-01 6.4750e-01 -5.6537e-02 -1.3046e+00 -5.8200e-01 1.2838e-01
-1.1416e-01 -8.0836e-01 -8.3921e-01 2.5609e-01 1.5629e-01 -9.7299e-01
1.1130e-01 4.4500e-01]
但是之后:
print(word_to_vec_map['the'])
KeyError Traceback (most recent call last)
<ipython-input-24-ebc9756c0cc8> in <module>
----> 1 print(word_to_vec_map['the'])
KeyError: 'the'
有谁知道为什么会这样?为什么我不能映射这个特定的词?