c++ - 根据一些输入文件预测“随机”文本生成中的下一个字符

Question

我正在编写一个基于马尔可夫模型生成随机文本的程序。我遇到了一个问题，有些文件在单词之间有很多空格，最初的种子被认为是一个空格。问题是所有下一个字符也被视为空格，因此生成的随机文本只是一个空白文档，因为 nextChosenChar 始终是一个空格。

有人可以提出一些解决这个问题的方法吗？

我试图提出一个解决方案，如下面代码的后半部分所示，但无济于事。

char ChooseNextChar(string seed, int order, string fileName){
    Map<string, Vector<char> > nextCharMap;
    ifstream inputStream;
    inputStream.open(fileName.c_str());
    int offset = 0;
    Vector<char> charsFollingSeedVector;
    inputStream.clear();
    char* buffer = new char [order + 1];
    char charFollowingSeed;
    static int consecutiveSpaces = 0;
    while (!inputStream.eof()) {    
        inputStream.seekg(offset);
        inputStream.read(buffer, order + 1);
        string key(buffer, order);
        if (equalsIgnoreCase(key, seed)) {
            //only insert key if not present otherwise overwriting old info 
            if (!nextCharMap.containsKey(seed)) {
                nextCharMap.put(seed, charsFollingSeedVector);
            }
            //read the char directly following seed
            charFollowingSeed = buffer[order];
            nextCharMap[seed].push_back(charFollowingSeed);
        }
        offset++;
    }
    //case where no chars following seed
    if (nextCharMap[seed].isEmpty()) {
        return EOF;
    }
    //determine which is the most frequent following char
    char nextChosenChar = MostFequentCharInVector(seed, nextCharMap);

    //TRYING TO FIX PROBLEM OF ONLY OUTPUTTING SPACES**********
     if (nextChosenChar == ' ') {
        consecutiveSpaces++;
        if (consecutiveSpaces >= 1) {
            nextChosenChar = nextCharMap[seed].get(randomInteger(0, nextCharMap[seed].size()-1));
            consecutiveSpaces = 0;
        }
    }
    return nextChosenChar;
}

score 2 · Accepted Answer

如果你真的想要一个基于字符的模型，你不会得到看起来很自然的文本作为输出，但这绝对是可能的，并且该模型从根本上也能够处理空格字符序列。如果您认为它们是文本的自然部分，则无需从输入中删除它们。

重要的是，马尔可夫模型并不总是回退到预测在任何给定阶段具有最高概率的一个字符。相反，它必须查看可能字符的整个概率分布，并随机选择一个。

在这里，随机意味着它选择了一个不是程序员预先确定的字符。尽管如此，随机分布并不是均匀分布，即并非所有字符的可能性都相同。它必须考虑各种可能字符的相对概率。一种方法是生成字符的累积概率分布，例如，如果概率是

p('a') == 0.2
p('b') == 0.4
p('c') == 0.4

我们将它们表示为

p('a') == 0.2
p('b') == p('a') + 0.4 == 0.6
p('c') == p('a') + p('b') == 1.0

那么生成一个随机字符，我们首先生成一个在0到1之间均匀分布的随机数N，然后选择第一个累积概率不小于N的字符。

我已经在下面的示例代码中实现了这一点。该train()过程为训练输入中的每个字符生成后续字符的累积概率分布。'predict()' 过程应用它来生成随机文本。

对于完整的实现，这仍然缺乏：

初始字符的概率分布的表示。正如您在“main()”函数中看到的，我的输出总是以“t”开头。
输出字符串长度或最终字符的表示。'main()' 总是生成一个长度为 100 的字符串。

该代码在 Linux 上使用 GCC 4.7.0（C++11 选项）进行了测试。下面的示例输出。

#include <iostream>
#include <string>
#include <vector>
#include <utility>
#include <map>
#include <numeric>
#include <algorithm>
#include <random>

template <typename Char>
class Markov
{
public:
  /* Data type used to count the frequencies (integer!) of
     characters. */
  typedef std::map<Char,unsigned>            CharDistributionMap;

  /* Data type used to represent a cumulative probability (float!)
     distribution. */
  typedef std::vector<std::pair<Char,float>> CharDistribution;

  /* Data type used to represent the Markov model. Each character is
     mapped to a probality distribution of the characters that follow
     it. */
  typedef std::map<Char,CharDistribution>    MarkovModel;


  /* The model. */
  MarkovModel  _model;

  /* Training procedure. */
  template <typename Iterator>
  void train(Iterator from, Iterator to)
  {
    _model = {};
    if (from == to)
      return;

    std::map<Char,CharDistributionMap> proto_model {};

    /* Count frequencies. */
    Char current = *from;
    while (true) {
      ++from;
      if (from == to)
        break;
      Char next = *from;
      proto_model[current][next] += 1;
      current = next;
    }

    /* Transform into probability distribution. */
    for (const auto &entry : proto_model) {
      const Char current              = entry.first;
      const CharDistributionMap &freq = entry.second;

      /* Calculate total frequency of current character. */
      unsigned total =
         std::accumulate(std::begin(freq),std::end(freq),0,
           [](unsigned res,const std::pair<Char,unsigned> &p){
                   return res += p.second;
               });

      /* Determine the probability distribution of characters that
         follow the current character. This is calculated as a cumulative
         probability. */
      CharDistribution dist {};
      float probability { 0.0 };
      std::for_each(std::begin(freq),std::end(freq),
             [total,&probability,&dist](const std::pair<Char,unsigned> &p){
                   // using '+=' to get cumulative probability:
                   probability += static_cast<float>(p.second) / total; 
                   dist.push_back(std::make_pair(p.first,probability));
             });

      /* Add probability distribution for current character to the model. */
      _model[current] = dist;
    }
  }


  /* Predict the next character, assuming that training has been
     performed. */
  template <typename RandomNumberGenerator>
  Char predict(RandomNumberGenerator &gen, const Char current)
  {
    static std::uniform_real_distribution<float> generator_dist { 0, 1 };

    /* Assume that the current character is known to the model. Otherwise,
       an std::out_of_range exception will be thrown. */
    const CharDistribution &dist { _model.at(current) };

    /* Generate random number between 0 and 1. */
    float random { generator_dist(gen) };

    /* Identify the character that has the greatest cumulative probabilty
       smaller than the random number generated. */
    auto res =
         std::lower_bound(std::begin(dist),std::end(dist),
                          std::make_pair(Char(),random),
             [](const std::pair<Char,float> &p1, const std::pair<Char,float> &p2) {
                    return (p1.second < p2.second);
             });
    if (res == std::end(dist))
      throw "Empty probability distribution. This should not happen.";
    return res->first;
  }

};

int main()
{
  /* Initialize random-number generator. */
  std::random_device rd;
  std::mt19937 gen(rd());


  std::string input { "this   is    some   input text   with   many spaces." };

  if (input.empty())
    return 1;

  /* We append the first character to the end, to ensure that even the
     last character of the text gets a non-empty probability
     distribution. A more proper way of dealing with character that
     have empty distributions would be _smoothing_. */
  input += input[0];

  Markov<char> markov {};
  markov.train(std::begin(input),std::end(input));

  /* We set the initial character. In a real stochastic model, there
     would have to be a separate probality distribution for initial
     character and we would choose the initial character randomly,
     too. */
  char current_char { 't' };

  for (unsigned i = 0 ; i < 100 ; ++i) {
    std::cout << current_char;
    current_char = markov.predict(gen,current_char);
  }
  std::cout << current_char << std::endl;
}

该程序生成的一些示例输出：

t  mext s.t th   winy  iny  somaces      sputhis inpacexthispace te  iny            me   mext mexthis

tes    is  manputhis.th is  wis.th with it    is  is.t  s   t   winy    it mext    is        ispany

this  maces      somany  t    s        it this  winy sputhisomacext manput    somanputes  macexte iso

t   wispanpaces maces  tesomacexte s  s  mes.th     isput t wit   t   somanputes   s  withit  sput ma

如您所见，空格字符的分布很自然地遵循输入文本中的分布。

score 0 · Accepted Answer

一种解决方案是从文件中逐个流式传输字符，以便您的阅读循环看起来更像这样：

char buffer[order];
inputStream.get(buffer,order);

char next_char;
while ( inputStream.get(next_char) )
{
   string key(buffer, order);
   if (equalsIgnoreCase(key, seed)) {
   // only insert key if not present otherwise overwriting old info 
   if (!nextCharMap.containsKey(seed)) {
      nextCharMap[seed] = Vector(charFollowingSeed);
   }
   else
   {
     nextCharMap[seed].push_back(charFollowingSeed);
   }
   // Update the buffer.
   for(unsigned int i=1; i<order; ++i) buffer[i-1]=buffer[i];
   buffer[order-1]=next_char;
}

然后你可以像这样丢弃额外的空格：

....
while ( inputStream.get(next_char) )
{
   //Remove multiple spaces from input.
   if( next_char==' ' and buffer[order-1]==' ') continue

   string key(buffer, order);
   ....

c++ - 根据一些输入文件预测“随机”文本生成中的下一个字符

2 回答 2

Related

Reference