c++ - C 正则表达式 VS C++11 正则表达式中的不同行为

Question

我需要一个将数学符号排列拆分为其元素的代码，让我们假设这个排列：

$排列$

排列字符串将是：

"(1,2,5)(3,4)" 或 "(3,4)(1,2,5)" 或 "(3,4)(5,1,2)"

我试过的模式是这样的：

([0-9]+[ ]*,[ ]*)*[0-9]+对于每个置换循环。这会将字符串分成"(1,2,5)(3,4)"两个字符串"1,2,5"和"3,4".
([0-9]+)对于循环中的每个元素。这会将每个周期分成单独的数字。

当我在此页面中尝试过这种模式时，它们运行良好。而且，我已经将它们与 C++11 正则表达式库一起使用，效果很好：

#include <iostream>
#include <string>

#include <regex>

void elements(const std::string &input)
{
    const std::regex ElementRegEx("[0-9]+");

    for (std::sregex_iterator Element(input.begin(), input.end(), ElementRegEx); Element != std::sregex_iterator(); ++Element)
    {
        const std::string CurrentElement(*Element->begin());
        std::cout << '\t' << CurrentElement << '\n';
    }
}

void cycles(const std::string &input)
{
    const std::regex CycleRegEx("([0-9]+[ ]*,[ ]*)*[0-9]+");

    for (std::sregex_iterator Cycle(input.begin(), input.end(), CycleRegEx); Cycle != std::sregex_iterator(); ++Cycle)
    {
        const std::string CurrentCycle(*Cycle->begin());
        std::cout << CurrentCycle << '\n';

        elements(CurrentCycle);
    }
}

int main(int argc, char **argv)
{
    std::string input("(1,2,5)(3,4)");

    std::cout << "input: " << input << "\n\n";

    cycles(input);
    return 0;
}

使用 Visual Studio 2010 (10.0) 编译的输出：

input: (1,2,5)(3,4)

1,2,5
    1
    2
    5
3,4
    3
    4

但不幸的是，我无法在我的项目中使用 C++11 工具，该项目将在 Linux 平台下运行，并且必须使用 gcc 4.2.3 编译；所以我不得不在regex.h标题中使用 C 正则表达式库。因此，使用相同的模式但使用不同的库，我会得到不同的结果：

这是测试代码：

void elements(const std::string &input)
{
    regex_t ElementRegEx;
    regcomp(&ElementRegEx, "([0-9]+)", REG_EXTENDED);

    regmatch_t ElementMatches[MAX_MATCHES];
    if (!regexec(&ElementRegEx, input.c_str(), MAX_MATCHES, ElementMatches, 0))
    {
        int Element = 0;

        while ((ElementMatches[Element].rm_so != -1) && (ElementMatches[Element].rm_eo != -1))
        {
            regmatch_t &ElementMatch = ElementMatches[Element];
            std::stringstream CurrentElement(input.substr(ElementMatch.rm_so, ElementMatch.rm_eo - ElementMatch.rm_so));
            std::cout << '\t' << CurrentElement << '\n';

            ++Element;
        }
    }

    regfree(&ElementRegEx);
}

void cycles(const std::string &input)
{
    regex_t CycleRegEx;
    regcomp(&CycleRegEx, "([0-9]+[ ]*,[ ]*)*[0-9]+", REG_EXTENDED);

    regmatch_t CycleMatches[MAX_MATCHES];
    if (!regexec(&CycleRegEx, input.c_str(), MAX_MATCHES, CycleMatches, 0))
    {
        int Cycle = 0;

        while ((CycleMatches[Cycle].rm_so != -1) && (CycleMatches[Cycle].rm_eo != -1))
        {
            regmatch_t &CycleMatch = CycleMatches[Cycle];
            const std::string CurrentCycle(input.substr(CycleMatch.rm_so, CycleMatch.rm_eo - CycleMatch.rm_so));
            std::cout << CurrentCycle << '\n';

            elements(CurrentCycle);
            ++Cycle;
        }
    }

    regfree(&CycleRegEx);
}

int main(int argc, char **argv)
{
    cycles("(1,2,5)(3,4)")
    return 0;
}

预期输出与使用 C++11 正则表达式相同，但实际输出为：

input: (1,2,5)(3,4)

1,2,5
    1
    1
2,
    2
    2

最后，问题是：

有人可以提示我在哪里误解了 C 正则表达式引擎吗？
为什么 C 正则表达式与 C++ 正则表达式的行为不同？

score 2 · Accepted Answer

您误解了regexec. pmatch缓冲区（之后pmatch[0]）填充了正则表达式的子匹配项，而不是字符串中的连续匹配项。

例如，如果您的正则表达式[a-z]([+ ])([0-9])与匹配x+5，pmatch[0]则将引用x+5（整个匹配项），pmatch[1]并且pmatch[2]将分别引用+and 5。

您需要regexec从上一场比赛的结尾开始重复循环：

int start = 0;
while (!regexec(&ElementRegEx, input.c_str() + start, MAX_MATCHES, ElementMatches, 0))
{
    regmatch_t &ElementMatch = ElementMatches[0];
    std::string CurrentElement(input.substr(start + ElementMatch.rm_so, ElementMatch.rm_eo - ElementMatch.rm_so));
    std::cout << '\t' << CurrentElement << '\n';
    start += ElementMatch.rm_eo;
}

c++ - C 正则表达式 VS C++11 正则表达式中的不同行为

1 回答 1

Related

Reference