keras - Keras 词嵌入矩阵的第一行为零

Question

我正在查看 Keras Glove 词嵌入示例，但不清楚为什么嵌入矩阵的第一行填充了零。

首先，在单词与数组关联的地方创建嵌入索引。

embeddings_index = {}
with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, 'f', sep=' ')
        embeddings_index[word] = coefs

然后通过查看标记器创建的索引中的单词来创建嵌入矩阵。

# prepare embedding matrix
num_words = min(MAX_NUM_WORDS, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i >= MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

由于循环将从开始i=1，因此如果矩阵以不同方式初始化，则第一行将仅包含零和随机数。跳过第一行有什么原因吗？

score 1 · Accepted Answer

整个过程是从 's 的程序员出于某种原因Tokenizer保留索引的事实开始的，可能是出于某种兼容性（某些其他语言使用来自的索引）或编码技术原因。01

但是他们使用 numpy，他们想用简单的索引：

embedding_matrix[i] = embedding_vector

索引，因此[0]索引行保持全零，并且没有任何情况，如“如果矩阵以不同方式初始化的随机数”所写，因为该数组已用zeros初始化。因此，从这一行开始，我们根本不需要第一行，但您不能删除它，因为 numpy 数组将失去其索引与标记器索引的对齐。

keras - Keras 词嵌入矩阵的第一行为零

1 回答 1

Related

Reference