我正在尝试以tf.keras.Model
矢量化方式运行 RNN 波束搜索,以使其在 GPU 上完全工作。然而,尽管tf.function
我可以将所有内容都设为矢量化,但无论有没有 GPU,它的运行速度都完全相同。附件是一个带有假模型的最小示例。实际上,对于 n=32,k=32,steps=128,这是我想要使用的,这需要 20 秒(每个 n=32 个样本)来解码,无论是在 CPU 上还是在 GPU 上!
我肯定错过了什么。当我训练模型时,在 GPU 上,批量大小为 512 的训练迭代(128 步)需要 100 毫秒,而在 CPU 上,批量大小为 32 的训练迭代需要 1 秒。GPU 在批量大小为 512 时并未饱和。我知道我有开销来自单独执行这些步骤并执行每个步骤的阻塞操作,但就计算而言,与模型的其余部分相比,我的开销可以忽略不计。
我也知道tf.keras.Model
以这种方式使用 a 可能并不理想,但是是否有另一种方法可以通过函数将输出张量连接回输入张量,尤其是重新连接状态?
完整的工作示例: https ://gist.github.com/meowcat/e3eaa4b8543a7c8444f4a74a9074b9ae
@tf.function
def decode_beam(states_init, scores_init, y_init, steps, k, n):
states = states_init
scores = scores_init
xstep = embed_y_to_x(y_init)
# Keep the results in TensorArrays
y_chain = tf.TensorArray(dtype="int32", size=steps)
sequences_chain = tf.TensorArray(dtype="int32", size=steps)
scores_chain = tf.TensorArray(dtype="float32", size=steps)
for i in range(steps):
# model_decode is the trained model with 3.5 million trainable params.
# Run a single step of the RNN model.
y, states = model_decode([xstep, states])
# Add scores of step n to previous scores
# (I left out the sequence end killer for this demo)
scores_y = tf.expand_dims(tf.reshape(scores, y.shape[:-1]), 2) + tm.log(y)
# Reshape into (n,k,tokens) and find the best k sequences to continue for each of n candidates
scores_y = tf.reshape(scores_y, [n, -1])
top_k = tm.top_k(scores_y, k, sorted=False)
# Transform the indices. I was using tf.unravel_index but
# `tf.debugging.set_log_device_placement(True)` indicated that this would be placed on the CPU
# thus I rewrote it
top_k_index = tf.reshape(
top_k[1] + tf.reshape(tf.range(n), (-1, 1)) * scores_y.shape[1], [-1])
ysequence = top_k_index // y.shape[2]
ymax = top_k_index % y.shape[2]
# this gives us two (n*k,) tensors with parent sequence (ysequence)
# and chosen character (ymax) per sequence.
# For continuation, pick the states, and "return" the scores
states = tf.gather(states, ysequence)
scores = tf.reshape(top_k[0], [-1])
# Write the results into the TensorArrays,
# and embed for the next step
xstep = embed_y_to_x(ymax)
y_chain = y_chain.write(i, ymax)
sequences_chain = sequences_chain.write(i, ysequence)
scores_chain = scores_chain.write(i, scores)
# Done: Stack up the results and return them
sequences_final = sequences_chain.stack()
y_final = y_chain.stack()
scores_final = scores_chain.stack()
return sequences_final, y_final, scores_final