c++ - 为什么我的有限状态机需要这么长时间才能执行？

Question

我正在研究一个状态机，它应该提取表单的函数调用

/* I am a comment */
//I am a comment
pref("this.is.a.string.which\"can have QUOTES\"", 123456);

提取的数据将pref("this.is.a.string.which\"can have QUOTES\"", 123456); 来自文件的位置。目前，要处理一个 41kb 的文件，这个过程需要将近一分半钟的时间。我在这里对这个有限状态机有什么严重误解吗？

#include <boost/algorithm/string.hpp>
std::vector<std::string> Foo()
{
    std::string fileData;
    //Fill filedata with the contents of a file
    std::vector<std::string> results;
    std::string::iterator begin = fileData.begin();
    std::string::iterator end = fileData.end();
    std::string::iterator stateZeroFoundLocation = fileData.begin();
    std::size_t state = 0;
    for(; begin < end; begin++)
    {
        switch (state)
        {
        case 0:
            if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {
                stateZeroFoundLocation = begin;
                begin += 4;
                state = 2;
            } else if (*begin == '/')
                state = 1;
            break;
        case 1:
            state = 0;
            switch (*begin)
            {
            case '*':
                begin = boost::find_first(boost::make_iterator_range(begin, end), "*/").end();
                break;
            case '/':
                begin = std::find(begin, end, L'\n');
            }
            break;
        case 2:
            if (*begin == '"')
                state = 3;
            break;
        case 3:
            switch(*begin)
            {
            case '\\':
                state = 4;
                break;
            case '"':
                state = 5;
            }
            break;
        case 4:
            state = 3;
            break;
        case 5:
            if (*begin == ',')
                state = 6;
            break;
        case 6:
            if (*begin != ' ')
                state = 7;
            break;
        case 7:
            switch(*begin)
            {
            case '"':
                state = 8;
                break;
            default:
                state = 10;
                break;
            }
            break;
        case 8:
            switch(*begin)
            {
            case '\\':
                state = 9;
                break;
            case '"':
                state = 10;
            }
            break;
        case 9:
            state = 8;
            break;
        case 10:
            if (*begin == ')')
                state = 11;
            break;
        case 11:
            if (*begin == ';')
                state = 12;
            break;
        case 12:
            state = 0;
            results.push_back(std::string(stateZeroFoundLocation, begin));
        };
    }
    return results;
}

比利3

编辑：嗯，这是我见过的最奇怪的事情之一。我刚刚重建了这个项目，它又可以正常运行了。奇怪的。

score 3 · Accepted Answer

除非您的 41 kb 文件主要是注释或首选项，否则它将在状态 0 中花费大部分时间。对于状态 0 中的每个字符，您至少进行两次函数调用。

if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

您可以通过预测试来加快速度，看看当前字符是否为 'p'

if (*begin == 'p' && boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

如果字符不是“p”，则无需进行任何函数调用。特别是不创建迭代器，这可能是花费时间的地方。

score 1 · Accepted Answer

我不知道这是否是问题的一部分，但是在 0 的情况下你有一个错字，“perf”被拼写为“pref”。

score 1 · Accepted Answer

好吧，仅仅看它很难说……但我猜查找算法正在这样做。为什么要在 FSM 中搜索？根据定义，您应该一次给他们一个字符....添加更多状态。还可以尝试将结果设为列表，而不是向量。大量的复制正在进行中

vector<string>

但主要是：配置文件！

score 0 · Accepted Answer

有限状态机是一种可行的解决方案，但对于文本处理，最好使用高度优化的有限状态机生成器。在这种情况下，一个正则表达式。这是 Perl 正则表达式：

# first clean the comments
$source =~ s|//.*$||;      # replace "// till end of line" with nothing
$source =~ s|/\*.*?\*/||s; # replace "/* any text until */" with nothing
                           # depending on your data, you may need a few other
                           # rules here to avoid blanking data, you could replace
                           # the comments with a unique identifier, and then
                           # expand any identifiers that the regex below returns

# then find your data
while ($source =~ /perf\(\s*"(.+?)",\s*(\d+)\s*\);/g) { 
   # matches your function signature and moves along source
   # do something with the captured groups, in this case $1 and $2
}

由于大多数正则表达式库都与 Perl 兼容，因此翻译语法应该不难。如果您的搜索变得更复杂，则需要使用解析器。

score 0 · Accepted Answer

如果您正在解析，为什么不使用解析器库。

我通常会想到 Boost.Spirit.Qi。

您可以使用类似 EBNF 的表达式来表达您的语法，这无疑更易于维护。
它是一个仅包含头文件的库，因此您可以毫无问题地将整个二进制文件混入其中。

虽然我可以欣赏极简主义的方法，但恐怕您自己编写有限状态机的想法并不那么明智。它适用于一个玩具示例，但随着需求的增加，您将拥有一个可怕switch的，并且理解正在发生的事情将变得越来越复杂。

请不要告诉我你知道它不会进化：我不相信神谕；）

c++ - 为什么我的有限状态机需要这么长时间才能执行？

5 回答 5

Related

Reference