1

当输入中给出一个单词时,我的二元语言模型工作正常,但是当我给我的三元模型提供两个单词时,它的行为很奇怪,并预测下一个单词是“未知”。 我的代码:

def get_unigram_probability(word):
  if word not in unigram:
      return 0
  return unigram[word] / total_words
    
def get_bigram_probability(words):
  if words not in bigram:
      return 0
  return bigram[words] / unigram[words[0]]
    
V = len(vocabulary)

def get_trigram_probability(words):
  if words not in trigram:
      return 0
  return trigram[words] + 1 / bigram[words[:2]] + V
  

对于 bi-gram 下一个词预测:

def find_next_word_bigram(words):
  candidate_list = []

  # Calculate probability for each word by looping through them
  for word in vocabulary:
    p2 = get_bigram_probability((words[-1], word))
    candidate_list.append((word, p2))
    
  # sort the list with words with often occurence in the beginning
  candidate_list.sort(key=lambda x: x[1], reverse=True)
  # print(candidate_list)
  return candidate_list[0]

对于三元组:

def find_next_word_trigram(words):
  candidate_list = []

  # Calculate probability for each word by looping through them
  for word in vocabulary:
    p3 = get_trigram_probability((words[-2], words[-1], word)) if len(words) >= 3 else 0
    candidate_list.append((word, p3))
    
  # sort the list with words with often occurence in the beginning
  candidate_list.sort(key=lambda x: x[1], reverse=True)
  # print(candidate_list)
  return candidate_list[0]

我只想知道我应该在代码中的哪个位置进行更改,以便三元组可以预测给定输入大小为 2 个单词的下一个单词。

4

1 回答 1

0

构建三元组时,请使用特殊的 BOS(句子开头)标记,以便处理短序列。基本上每句前加两次BOS,像这样:

I like cheese
BOS BOS I like cheese

这样,当您从用户那里获取输入时,您可以预先BOS BOS添加它,并且能够完成甚至很短的序列。

于 2020-11-09T04:03:50.773 回答