python - 给定输入大小为 2 个单词，三元组预测下一个单词的行为应该是什么？

Question

当输入中给出一个单词时，我的二元语言模型工作正常，但是当我给我的三元模型提供两个单词时，它的行为很奇怪，并预测下一个单词是“未知”。 我的代码：

def get_unigram_probability(word):
  if word not in unigram:
      return 0
  return unigram[word] / total_words
    
def get_bigram_probability(words):
  if words not in bigram:
      return 0
  return bigram[words] / unigram[words[0]]
    
V = len(vocabulary)

def get_trigram_probability(words):
  if words not in trigram:
      return 0
  return trigram[words] + 1 / bigram[words[:2]] + V

对于 bi-gram 下一个词预测：

def find_next_word_bigram(words):
  candidate_list = []

  # Calculate probability for each word by looping through them
  for word in vocabulary:
    p2 = get_bigram_probability((words[-1], word))
    candidate_list.append((word, p2))
    
  # sort the list with words with often occurence in the beginning
  candidate_list.sort(key=lambda x: x[1], reverse=True)
  # print(candidate_list)
  return candidate_list[0]

对于三元组：

def find_next_word_trigram(words):
  candidate_list = []

  # Calculate probability for each word by looping through them
  for word in vocabulary:
    p3 = get_trigram_probability((words[-2], words[-1], word)) if len(words) >= 3 else 0
    candidate_list.append((word, p3))
    
  # sort the list with words with often occurence in the beginning
  candidate_list.sort(key=lambda x: x[1], reverse=True)
  # print(candidate_list)
  return candidate_list[0]

我只想知道我应该在代码中的哪个位置进行更改，以便三元组可以预测给定输入大小为 2 个单词的下一个单词。

score 0 · Accepted Answer

构建三元组时，请使用特殊的 BOS（句子开头）标记，以便处理短序列。基本上每句前加两次BOS，像这样：

I like cheese
BOS BOS I like cheese

这样，当您从用户那里获取输入时，您可以预先BOS BOS添加它，并且能够完成甚至很短的序列。

python - 给定输入大小为 2 个单词，三元组预测下一个单词的行为应该是什么？

1 回答 1

Related

Reference