我是机器学习和计算概率的新手。这是Lingpipe 的一个示例,用于通过训练数据在单词中添加音节。
Given a source model p(h) for hyphenated words, and a channel model p(w|h) defined so that p(w|h) = 1 if w is equal to h with the hyphens removed and 0 otherwise. We then seek to find the most likely source message h to have produced message w by:
ARGMAXh p(h|w) = ARGMAXh p(w|h) p(h) / p(w)
= ARGMAXh p(w|h) p(h)
= ARGMAXh s.t. strip(h)=w p(h)
where we use strip(h) = w to mean that w is equal to h with the hyphenations stripped out (in Java terms, h.replaceAll(" ","").equals(w)). Thus with a deterministic channel, we wind up looking for the most likely hyphenation h according to p(h), restricting our search to h that produce w when the hyphens are stripped out.
我不明白如何使用它来构建音节模型。
如果有一个训练集包含:
a bid jan
a bide
a bie
a bil i ty
a bim e lech
如何有一个可以音节化单词的模型?我的意思是为了找到一个新单词的可能的音节中断要计算什么。
首先计算什么?然后计算什么?你能具体举个例子吗?
非常感谢。