python - Predicting the next word using the LSTM ptb model tensorflow example

Question

I am trying to use the tensorflow LSTM model to make next word predictions.

As described in this related question (which has no accepted answer) the example contains pseudocode to extract next word probabilities:

lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])

loss = 0.0
for current_batch_of_words in words_in_dataset:
  # The value of state is updated after processing each batch of words.
  output, state = lstm(current_batch_of_words, state)

  # The LSTM output can be used to make next word predictions
  logits = tf.matmul(output, softmax_w) + softmax_b
  probabilities = tf.nn.softmax(logits)
  loss += loss_function(probabilities, target_words)

I am confused about how to interpret the probabilities vector. I modified the __init__ function of the PTBModel in ptb_word_lm.py to store the probabilities and logits:

class PTBModel(object):
  """The PTB model."""

  def __init__(self, is_training, config):
    # General definition of LSTM (unrolled)
    # identical to tensorflow example ...     
    # omitted for brevity ...


    # computing the logits (also from example code)
    logits = tf.nn.xw_plus_b(output,
                             tf.get_variable("softmax_w", [size, vocab_size]),
                             tf.get_variable("softmax_b", [vocab_size]))
    loss = seq2seq.sequence_loss_by_example([logits],
                                            [tf.reshape(self._targets, [-1])],
                                            [tf.ones([batch_size * num_steps])],
                                            vocab_size)
    self._cost = cost = tf.reduce_sum(loss) / batch_size
    self._final_state = states[-1]

    # my addition: storing the probabilities and logits
    self.probabilities = tf.nn.softmax(logits)
    self.logits = logits

    # more model definition ...

Then printed some info about them in the run_epoch function:

def run_epoch(session, m, data, eval_op, verbose=True):
  """Runs the model on the given data."""
  # first part of function unchanged from example

  for step, (x, y) in enumerate(reader.ptb_iterator(data, m.batch_size,
                                                    m.num_steps)):
    # evaluate proobability and logit tensors too:
    cost, state, probs, logits, _ = session.run([m.cost, m.final_state, m.probabilities, m.logits, eval_op],
                                 {m.input_data: x,
                                  m.targets: y,
                                  m.initial_state: state})
    costs += cost
    iters += m.num_steps

    if verbose and step % (epoch_size // 10) == 10:
      print("%.3f perplexity: %.3f speed: %.0f wps, n_iters: %s" %
            (step * 1.0 / epoch_size, np.exp(costs / iters),
             iters * m.batch_size / (time.time() - start_time), iters))
      chosen_word = np.argmax(probs, 1)
      print("Probabilities shape: %s, Logits shape: %s" % 
            (probs.shape, logits.shape) )
      print(chosen_word)
      print("Batch size: %s, Num steps: %s" % (m.batch_size, m.num_steps))

  return np.exp(costs / iters)

This produces output like this:

0.000 perplexity: 741.577 speed: 230 wps, n_iters: 220
(20, 10000) (20, 10000)
[ 14   1   6 589   1   5   0  87   6   5   3   5   2   2   2   2   6   2  6   1]
Batch size: 1, Num steps: 20

I was expecting the probs vector to be an array of probabilities, with one for each word in the vocabulary (eg with shape (1, vocab_size)), meaning that I could get the predicted word using np.argmax(probs, 1) as suggested in the other question.

However, the first dimension of the vector is actually equal to the number of steps in the unrolled LSTM (20 if the small config settings are used), which I'm not sure what to do with. To access to the predicted word, do I just need to use the last value (because it's the output of the final step)? Or is there something else that I'm missing?

I tried to understand how the predictions are made and evaluated by looking at the implementation of seq2seq.sequence_loss_by_example, which must perform this evaluation, but this ends up calling gen_nn_ops._sparse_softmax_cross_entropy_with_logits, which doesn't seem to be included in the github repo, so I'm not sure where else to look.

I'm quite new to both tensorflow and LSTMs, so any help is appreciated!

score 8 · Accepted Answer

output张量包含每个时间步长的 LSTM 单元输出的串联（请参见此处的定义）。chosen_word[-1]因此，您可以通过取（或者chosen_word[sequence_length - 1]如果序列已被填充以匹配展开的 LSTM）来找到下一个单词的预测。

该tf.nn.sparse_softmax_cross_entropy_with_logits()操作以不同的名称记录在公共 API 中。由于技术原因，它调用了一个没有出现在 GitHub 存储库中的生成的包装器函数。op 的实现是在 C++ 中，这里是.

score 5 · Accepted Answer

我也在实现 seq2seq 模型。

所以让我试着用我的理解来解释：

LSTM 模型的输出是一个大小为 [ batch_size , size ] 的二维张量的列表（长度为num_steps ）。

代码行：

output = tf.reshape(tf.concat(1, outputs), [-1, size])

将产生一个新的输出，它是一个大小为 [ batch_size x num_steps , size ] 的 2D 张量。

对于您的情况，batch_size = 1 和 num_steps = 20 --> 输出形状为 [ 20 , size ]。

代码行：

logits = tf.nn.xw_plus_b(output, tf.get_variable("softmax_w", [size, vocab_size]), tf.get_variable("softmax_b", [vocab_size]))

<=> output [batch_size x num_steps, size] x softmax_w [size, vocab_size] 将输出大小为 [ batch_size x num_steps , vocab_size ] 的 logits。
对于您的情况，大小为 [ 20 , vocab_size ] --> probs张量的logits与 [ 20 , vocab_size ] 的logits大小相同。

代码行：

chosen_word = np.argmax(probs, 1)

将输出大小为 [ 20 , 1 ] 的selected_word张量，每个值是当前单词的下一个预测词索引。

代码行：

loss = seq2seq.sequence_loss_by_example([logits], [tf.reshape(self._targets, [-1])], [tf.ones([batch_size * num_steps])])

是计算序列的batch_size的 softmax 交叉熵损失。

python - Predicting the next word using the LSTM ptb model tensorflow example

2 回答 2

Related

Reference