python - 从 lm_1b 训练模型中提取单词/句子概率

Question

我已经成功下载了使用 CNN-LSTM 训练的 1B 单词语言模型（https://github.com/tensorflow/models/tree/master/research/lm_1b），我希望能够输入句子或部分句子得到句子中每个后续单词的概率。

例如，如果我有一个句子，如“会说的动物”，我想知道下一个词是“woof”与“meow”的概率。

我知道运行以下命令会产生 LSTM 嵌入：

bazel-bin/lm_1b/lm_1b_eval --mode dump_lstm_emb \
                           --pbtxt data/graph-2016-09-10.pbtxt \
                           --vocab_file data/vocab-2016-09-10.txt \
                           --ckpt 'data/ckpt-*' \
                           --sentence "An animal that says woof" \                             
                           --save_dir output

这将生成文件lstm_emb_step_*.npy，其中每个文件是句子中每个单词的 LSTM 嵌入。如何将这些转换为经过训练的模型的概率，以便能够与进行P(woof|An animal that says)比较P(meow|An animal that says)？

提前致谢。

score 0 · Accepted Answer

我想做同样的事情，这就是我想出的，改编自他们的一些演示代码。我不完全确定这是正确的，但它似乎产生了合理的值。

def get_probability_of_next_word(sess, t, vocab, prefix_words, query):
  """
  Return the probability of the given word based on the sequence of prefix 
  words. 

  :param sess: Tensorflow session object
  :param t: Tensorflow ??? object
  :param vocab: Vocabulary model, maps id <-> string, stores max word chard id length
  :param list prefix_words: List of words that appear before this one. 
  :param str query: The query word
  """
  targets = np.zeros([BATCH_SIZE, NUM_TIMESTEPS], np.int32)
  weights = np.ones([BATCH_SIZE, NUM_TIMESTEPS], np.float32)

  if not prefix_words or prefix_words[0] != "<S>":
    prefix_words.insert(0, "<S>")

  prefix = [vocab.word_to_id(w) for w in prefix_words]
  prefix_char_ids = [vocab.word_to_char_ids(w) for w in prefix_words]

  inputs = np.zeros([BATCH_SIZE, NUM_TIMESTEPS], np.int32)
  char_ids_inputs = np.zeros(
    [BATCH_SIZE, NUM_TIMESTEPS, vocab.max_word_length], np.int32)
  inputs[0, 0] = prefix[0]
  char_ids_inputs[0, 0, :] = prefix_char_ids[0]
  softmax = sess.run(t['softmax_out'],
                     feed_dict={t['char_inputs_in']: char_ids_inputs,
                                t['inputs_in']: inputs,
                                t['targets_in']: targets,
                                t['target_weights_in']: weights})

  return softmax[0, vocab.word_to_id(query)]

示例用法

vocab = CharsVocabulary(vocab_path, MAX_WORD_LEN)
sess, t = LoadModel(model_path, ckptdir + "/ckpt-*")
result = get_probability_of_next_word(sess, t, vocab, ["Hello", "my", "friend"], "for")

给出的结果8.811023e-05。请注意，CharsVocabulary并且LoadModel与 repo 中的内容略有不同。

另请注意，此功能非常慢。也许有人知道如何改进它。

python - 从 lm_1b 训练模型中提取单词/句子概率

1 回答 1

Related

Reference