如果你真的想要一个基于字符的模型,你不会得到看起来很自然的文本作为输出,但这绝对是可能的,并且该模型从根本上也能够处理空格字符序列。如果您认为它们是文本的自然部分,则无需从输入中删除它们。
重要的是,马尔可夫模型并不总是回退到预测在任何给定阶段具有最高概率的一个字符。相反,它必须查看可能字符的整个概率分布,并随机选择一个。
在这里,随机意味着它选择了一个不是程序员预先确定的字符。尽管如此,随机分布并不是均匀分布,即并非所有字符的可能性都相同。它必须考虑各种可能字符的相对概率。一种方法是生成字符的累积概率分布,例如,如果概率是
p('a') == 0.2
p('b') == 0.4
p('c') == 0.4
我们将它们表示为
p('a') == 0.2
p('b') == p('a') + 0.4 == 0.6
p('c') == p('a') + p('b') == 1.0
那么生成一个随机字符,我们首先生成一个在0到1之间均匀分布的随机数N,然后选择第一个累积概率不小于N的字符。
我已经在下面的示例代码中实现了这一点。该train()
过程为训练输入中的每个字符生成后续字符的累积概率分布。'predict()' 过程应用它来生成随机文本。
对于完整的实现,这仍然缺乏:
- 初始字符的概率分布的表示。正如您在“main()”函数中看到的,我的输出总是以“t”开头。
- 输出字符串长度或最终字符的表示。'main()' 总是生成一个长度为 100 的字符串。
该代码在 Linux 上使用 GCC 4.7.0(C++11 选项)进行了测试。下面的示例输出。
#include <iostream>
#include <string>
#include <vector>
#include <utility>
#include <map>
#include <numeric>
#include <algorithm>
#include <random>
template <typename Char>
class Markov
{
public:
/* Data type used to count the frequencies (integer!) of
characters. */
typedef std::map<Char,unsigned> CharDistributionMap;
/* Data type used to represent a cumulative probability (float!)
distribution. */
typedef std::vector<std::pair<Char,float>> CharDistribution;
/* Data type used to represent the Markov model. Each character is
mapped to a probality distribution of the characters that follow
it. */
typedef std::map<Char,CharDistribution> MarkovModel;
/* The model. */
MarkovModel _model;
/* Training procedure. */
template <typename Iterator>
void train(Iterator from, Iterator to)
{
_model = {};
if (from == to)
return;
std::map<Char,CharDistributionMap> proto_model {};
/* Count frequencies. */
Char current = *from;
while (true) {
++from;
if (from == to)
break;
Char next = *from;
proto_model[current][next] += 1;
current = next;
}
/* Transform into probability distribution. */
for (const auto &entry : proto_model) {
const Char current = entry.first;
const CharDistributionMap &freq = entry.second;
/* Calculate total frequency of current character. */
unsigned total =
std::accumulate(std::begin(freq),std::end(freq),0,
[](unsigned res,const std::pair<Char,unsigned> &p){
return res += p.second;
});
/* Determine the probability distribution of characters that
follow the current character. This is calculated as a cumulative
probability. */
CharDistribution dist {};
float probability { 0.0 };
std::for_each(std::begin(freq),std::end(freq),
[total,&probability,&dist](const std::pair<Char,unsigned> &p){
// using '+=' to get cumulative probability:
probability += static_cast<float>(p.second) / total;
dist.push_back(std::make_pair(p.first,probability));
});
/* Add probability distribution for current character to the model. */
_model[current] = dist;
}
}
/* Predict the next character, assuming that training has been
performed. */
template <typename RandomNumberGenerator>
Char predict(RandomNumberGenerator &gen, const Char current)
{
static std::uniform_real_distribution<float> generator_dist { 0, 1 };
/* Assume that the current character is known to the model. Otherwise,
an std::out_of_range exception will be thrown. */
const CharDistribution &dist { _model.at(current) };
/* Generate random number between 0 and 1. */
float random { generator_dist(gen) };
/* Identify the character that has the greatest cumulative probabilty
smaller than the random number generated. */
auto res =
std::lower_bound(std::begin(dist),std::end(dist),
std::make_pair(Char(),random),
[](const std::pair<Char,float> &p1, const std::pair<Char,float> &p2) {
return (p1.second < p2.second);
});
if (res == std::end(dist))
throw "Empty probability distribution. This should not happen.";
return res->first;
}
};
int main()
{
/* Initialize random-number generator. */
std::random_device rd;
std::mt19937 gen(rd());
std::string input { "this is some input text with many spaces." };
if (input.empty())
return 1;
/* We append the first character to the end, to ensure that even the
last character of the text gets a non-empty probability
distribution. A more proper way of dealing with character that
have empty distributions would be _smoothing_. */
input += input[0];
Markov<char> markov {};
markov.train(std::begin(input),std::end(input));
/* We set the initial character. In a real stochastic model, there
would have to be a separate probality distribution for initial
character and we would choose the initial character randomly,
too. */
char current_char { 't' };
for (unsigned i = 0 ; i < 100 ; ++i) {
std::cout << current_char;
current_char = markov.predict(gen,current_char);
}
std::cout << current_char << std::endl;
}
该程序生成的一些示例输出:
t mext s.t th winy iny somaces sputhis inpacexthispace te iny me mext mexthis
tes is manputhis.th is wis.th with it is is.t s t winy it mext is ispany
this maces somany t s it this winy sputhisomacext manput somanputes macexte iso
t wispanpaces maces tesomacexte s s mes.th isput t wit t somanputes s withit sput ma
如您所见,空格字符的分布很自然地遵循输入文本中的分布。