python - 位置编码导致更差的收敛，语言建模

Question

这是一个棘手的问题，但我不妨试试。我正在实施本文https://arxiv.org/pdf/1503.08895.pdf中的架构以进行语言建模。有关图表，请参见第 2 页，有关位置或“时间”编码的部分，请参见第 5 页的顶部。有关位置编码的更多信息，请参见第 5 页底部/第 6 页顶部的https://arxiv.org/pdf/1706.03762.pdf。（我被第一篇的作者引导到第二篇论文。）

简而言之，这是我的 keras 实现：

word_seq = Input(shape = (SEQ_LEN,), dtype = "int32", name = "word_seq")

query = Input(shape = (EMBED_DIM, ), dtype = "float32", name = "q_input")
#the query for lang. modeling is a constant vector filled with 0.1, as described at the bottom of page 7 in the first linked paper

T_A = Added_Weights(input_dim = (SEQ_LEN, EMBED_DIM))
#Added_Weights is a custom layer I wrote, which I'll post below
#These are the "positional encoding" components

T_C = Added_Weights(input_dim = (SEQ_LEN, EMBED_DIM))

Emb_A = Embedding(output_dim = EMBED_DIM, input_dim = VOCAB_SIZE, input_length = SEQ_LEN, name = "Emb_A")

Emb_C = Embedding(output_dim = EMBED_DIM, input_dim = VOCAB_SIZE, input_length = SEQ_LEN, name = "Emb_C")

int_state_weights = Dense(units = EMBED_DIM, activation = 'linear',
           kernel_initializer=RandomNormal(mean=0., stddev = 0.05, seed = None))

layer_output = query
#the loop uses the output from the previous layer as the query, but the first layer's query is just that constant vector

for i in range(0, NUM_LAYERS - 1):
    memories = Emb_A(word_seq) #these all re-use the weights instantiated earlier.

    memories = T_A(memories)

    memories = Dropout(DROPOUT_R)(memories)

    content = Emb_C(word_seq)

    content = T_C(content)

    mem_relevance = Dot(axes=[1, 2])([layer_output, memories])

    weighted_internal_state = int_state_weights(mem_relevance)

    mem_relevance = Softmax()(mem_relevance)

    content_relevance = Dot(axes=1)([mem_relevance,
                                content])  # weight each piece of content by it's probability of being relevant

    layer_output = Add()([content_relevance, weighted_internal_state])

    layer_output = Dropout(DROPOUT_R)(layer_output)

final_output = Dense(units = VOCAB_SIZE, activation ='relu',
                 kernel_initializer=RandomNormal(mean=0., stddev = 0.05, seed = None))(layer_output)

model = Model(inputs = [word_seq, query], outputs = prediction)
model.compile(optimizer = SGD(lr = 0.01, clipnorm = 50.), loss = 'categorical_crossentropy', metrics = ['accuracy'])
model.fit(x = [td_seqs, td_query], y = [td_labels],
      batch_size = BATCH_SIZE, callbacks = [lr_adjust, lr_termination, for_csv], epochs=200, verbose = 1)

BATCH_SIZE 目前是 128。在我添加 T_A 和 T_C 部分之前，这在大约 35,000 个训练样本上运行良好，最终达到 96% 的准确率。一旦我实现了 T_A 和 T_C（位置编码），训练以大约 10% 的准确率和 5.2-ish 的训练损失结束。我将训练数据增加了 10 倍，但没有看到任何真正的改进。这是我的 Additional_Weights 类：

class Added_Weights(Layer):

    def __init__(self, input_dim, **kwargs):
        super(Added_Weights, self).__init__(**kwargs)
        self.input_dim = input_dim

    def build(self, input_shape):
        # Create a trainable weight variable for this layer.
        self.kernel = self.add_weight(name='kernel',
                                  shape=(self.input_dim[0], self.input_dim[1]),
                                  initializer=RandomNormal(mean=0., stddev=0.05, seed=None),
                                  trainable=True)


        super(Added_Weights, self).build(input_shape)  


    def call(self, x, **kwargs):
        return x + self.kernel

    def compute_output_shape(self, input_shape):
        return input_shape

在阅读了这两篇非常棒的论文明确指出它应该起作用之后，我正在为为什么这不起作用而苦恼。如果有人能设法帮助解决这个问题，那就太棒了。

python - 位置编码导致更差的收敛，语言建模

0 回答 0

Related

Reference