最近在学习语音识别,了解到前缀beam search的思想是把具有相同前缀的路径合并起来,比如[1,1,_]
and [_,1,_]
(可以看到,_
表示空白标记)。
基于这种理解,我实现了我的一个版本,可以使用如下伪代码进行简化:
def prefix_beam_search(y, beam_size, blank):
seq_len, n_class = y.shape
logY = np.log(y)
beam = [([], 0)]
for t in range(seq_len):
buff = []
for prefix, p in beam:
for i in range(n_class):
new_prefix = list(prefix) + [i]
new_p = p + logY[t][i]
buff.append((new_prefix, new_p))
# merge the paths with same prefix'
new_beam = defaultdict(lambda: ninf)
for prefix, p in buff:
# 'norm_prefix' can simplify the path, [1,1,_,2] ==> [1,2]
# However, the ending 'blank' is retained, [1,1,_] ==> [1,_]
prefix = norm_prefix(prefix, blank)
new_beam[prefix] = logsumexp(new_beam[prefix], p)
# choose the best paths
new_beam = sorted(new_beam.items(), key=lambda x: x[1], reverse=True)
beam = new_beam[: beam_size]
return beam
但是我在网上找到的大多数版本(根据论文)都是这样的:
def _prefix_beam_decode(y, beam_size, blank):
T, V = y.shape
log_y = np.log(y)
beam = [(tuple(), (0, ninf))]
for t in range(T):
new_beam = defaultdict(lambda: (ninf, ninf))
for prefix, (p_b, p_nb) in beam:
for i in range(V):
p = log_y[t, i]
if i == blank:
new_p_b, new_p_nb = new_beam[prefix]
new_p_b = logsumexp(new_p_b, p_b + p, p_nb + p)
new_beam[prefix] = (new_p_b, new_p_nb)
continue
end_t = prefix[-1] if prefix else None
new_prefix = prefix + (i,)
new_p_b, new_p_nb = new_beam[new_prefix]
if i != end_t:
new_p_nb = logsumexp(new_p_nb, p_b + p, p_nb + p)
else:
new_p_nb = logsumexp(new_p_nb, p_b + p)
new_beam[new_prefix] = (new_p_b, new_p_nb)
if i == end_t:
new_p_b, new_p_nb = new_beam[prefix]
new_p_nb = logsumexp(new_p_nb, p_nb + p)
new_beam[prefix] = (new_p_b, new_p_nb)
beam = sorted(new_beam.items(), key=lambda x: logsumexp(*x[1]), reverse=True)
beam = beam[:beam_size]
return beam
两者的结果不同,我的版本倾向于返回更长的字符串。而且我不太了解主要的两个方面:
- 我的版本有什么细节没有考虑到吗?
- 通用版本同时生成新前缀,
new_prefix = prefix + (i,)
无论前一个的结尾是否与给定的's'相同。例如,旧前缀是[a,a,b]
and,当添加新字符 s 时,都保存[a,a,b]
和[a,a,b,b]
。如果这样做有什么目的?它会导致重复计算吗?
期待您的回答,提前致谢!