c++ - C ++中的正则表达式字符类减法

Question

我正在编写一个 C++ 程序，它需要采用 XML Schema 文件中定义的正则表达式并使用它们来验证 XML 数据。问题是，C++ 似乎不直接支持 XML 模式使用的正则表达式风格。

例如，有几个特殊的字符类\i，\c默认情况下没有定义，而且 XML Schema 正则表达式语言支持称为“字符类减法”的东西，这在 C++ 中似乎不支持。

允许使用\i和\c特殊字符类非常简单，我可以在正则表达式中查找“\i”或“\c”并用它们的扩展版本替换它们，但是要让字符类减法工作更重要令人生畏的问题...

例如，这个在 XML 模式定义中有效的正则表达式会在 C++ 中引发异常，指出它具有不平衡的方括号。

#include <iostream>
#include <regex>

int main()
{
    try
    {
        // Match any lowercase letter that is not a vowel
        std::regex rx("[a-z-[aeiuo]]");
    }
    catch (const std::regex_error& ex)
    {
        std::cout << ex.what() << std::endl;
    }
}

如何让 C++ 识别正则表达式中的字符类减法？或者更好的是，有没有办法直接在 C++ 中使用正则表达式的 XML Schema 风格？

score 3 · Accepted Answer

字符范围减法或交集在支持的任何语法中都不可用std::regex，因此您必须将表达式重写为受支持的表达式之一。

最简单的方法是自己执行减法并将集合传递给std::regex，例如[bcdfghjklvmnpqrstvwxyz]您的示例。

另一种解决方案是寻找功能更强大的正则表达式引擎或支持 XML Schema 及其正则表达式语言的专用 XML 库。

score 2 · Accepted Answer

从cppreference 示例开始

#include <iostream>
#include <regex>
 
void show_matches(const std::string& in, const std::string& re)
{
    std::smatch m;
    std::regex_search(in, m, std::regex(re));
    if(m.empty()) {
        std::cout << "input=[" << in << "], regex=[" << re << "]: NO MATCH\n";
    } else {
        std::cout << "input=[" << in << "], regex=[" << re << "]: ";
        std::cout << "prefix=[" << m.prefix() << "] ";
        for(std::size_t n = 0; n < m.size(); ++n)
            std::cout << " m[" << n << "]=[" << m[n] << "] ";
        std::cout << "suffix=[" << m.suffix() << "]\n";
    }
}
 
int main()
{
    // greedy match, repeats [a-z] 4 times
    show_matches("abcdefghi", "(?:(?![aeiou])[a-z]){2,4}");
}

您可以在此处测试和检查正则表达式的详细信息。

选择使用非捕获组(?: ...)是为了防止它更改您的组，以防您在更大的正则表达式中使用它。

(?![aeiou])如果找到不匹配的字符，将匹配而不消耗输入[aeiou]，[a-z]则将匹配字母。结合这两个条件相当于你的字符类减法。

The{2,4}是一个量词，表示从 2 到 4，也可以+表示一个或多个，*表示零或多个。

编辑

阅读其他答案中的评论，我了解您想要支持XMLSchema。

下一个程序展示了如何使用 ECMA 正则表达式将“字符类差异”转换为 ECMA 兼容格式。

#include <iostream>
#include <regex>
#include <string>
#include <vector>

std::string translated_regex(const std::string &pattern){
    // pattern to identify character class subtraction
    std::regex class_subtraction_re(
       "\\[((?:\\\\[\\[\\]]|[^[\\]])*)-\\[((?:\\\\[\\[\\]]|[^[\\]])*)\\]\\]"
    );
    // translate the regular expression to ECMA compatible
    std::string translated = std::regex_replace(pattern, 
       class_subtraction_re, "(?:(?![$2])[$1])");
    return translated;
}
void show_matches(const std::string& in, const std::string& re)
{
    std::smatch m;
    std::regex_search(in, m, std::regex(re));
    if(m.empty()) {
        std::cout << "input=[" << in << "], regex=[" << re << "]: NO MATCH\n";
    } else {
        std::cout << "input=[" << in << "], regex=[" << re << "]: ";
        std::cout << "prefix=[" << m.prefix() << "] ";
        for(std::size_t n = 0; n < m.size(); ++n)
            std::cout << " m[" << n << "]=[" << m[n] << "] ";
        std::cout << "suffix=[" << m.suffix() << "]\n";
    }
}



int main()
{
    std::vector<std::string> tests = {
        "Some text [0-9-[4]] suffix", 
        "([abcde-[ae]])",
        "[a-z-[aei]]|[A-Z-[OU]] "
    };
    std::string re = translated_regex("[a-z-[aeiou]]{2,4}");
    show_matches("abcdefghi", re);
    
    for(std::string test : tests){
       std::cout << " " << test << '\n' 
        << "   -- " << translated_regex(test) << '\n'; 
    }
    
    return 0;
}

编辑：递归和命名字符类

上述方法不适用于递归字符类否定。并且没有办法仅使用正则表达式来处理递归替换。这使得解决方案变得不那么直接了。

解决方案有以下几个级别

一个函数扫描正则表达式以查找[
当[找到 a 时，有一个函数可以在找到 '-[` 时递归地处理字符类。
该模式\p{xxxxx}被单独处理以识别命名字符模式。命名类在specialCharClass地图中定义，我填写了两个示例。

#include <iostream>
#include <regex>
#include <string>
#include <vector>
#include <map>

std::map<std::string, std::string> specialCharClass = {
    {"IsDigit", "0-9"},
    {"IsBasicLatin", "a-zA-Z"}
    // Feel free to add the character classes you want
};

const std::string getCharClassByName(const std::string &pattern, size_t &pos){
    std::string key;
    while(++pos < pattern.size() && pattern[pos] != '}'){
        key += pattern[pos];
    }
    ++pos;
    return specialCharClass[key];
}

std::string translate_char_class(const std::string &pattern, size_t &pos){
    
    std::string positive;
    std::string negative;
    if(pattern[pos] != '['){
        return "";
    }
    ++pos;
    
    while(pos < pattern.size()){
        if(pattern[pos] == ']'){
            ++pos;
            if(negative.size() != 0){
                return "(?:(?!" + negative + ")[" + positive + "])";
            }else{
                return "[" + positive + "]";
            }
        }else if(pattern[pos] == '\\'){
            if(pos + 3 < pattern.size() && pattern[pos+1] == 'p'){
                positive += getCharClassByName(pattern, pos += 2);
            }else{
                positive += pattern[pos++];
                positive += pattern[pos++];
            }
        }else if(pattern[pos] == '-' && pos + 1 < pattern.size() && pattern[pos+1] == '['){
            if(negative.size() == 0){
                negative = translate_char_class(pattern, ++pos);
            }else{
                negative += '|';
                negative = translate_char_class(pattern, ++pos);
            }
        }else{
            positive += pattern[pos++];
        }
    }
    return '[' + positive; // there is an error pass, forward it
}

std::string translate_regex(const std::string &pattern, size_t pos = 0){
    std::string r;
    while(pos < pattern.size()){
        if(pattern[pos] == '\\'){
            r += pattern[pos++];
            r += pattern[pos++];
        }else if(pattern[pos] == '['){
            r += translate_char_class(pattern, pos);
        }else{
            r += pattern[pos++];
        }
    }
    return r;
}

void show_matches(const std::string& in, const std::string& re)
{
    std::smatch m;
    std::regex_search(in, m, std::regex(re));
    if(m.empty()) {
        std::cout << "input=[" << in << "], regex=[" << re << "]: NO MATCH\n";
    } else {
        std::cout << "input=[" << in << "], regex=[" << re << "]: ";
        std::cout << "prefix=[" << m.prefix() << "] ";
        for(std::size_t n = 0; n < m.size(); ++n)
            std::cout << " m[" << n << "]=[" << m[n] << "] ";
        std::cout << "suffix=[" << m.suffix() << "]\n";
    }
}



int main()
{
    std::vector<std::string> tests = {
        "[a]",
        "[a-z]d",
        "[\\p{IsBasicLatin}-[\\p{IsDigit}-[89]]]",
        "[a-z-[aeiou]]{2,4}",
        "[a-z-[aeiou-[e]]]",
        "Some text [0-9-[4]] suffix", 
        "([abcde-[ae]])",
        "[a-z-[aei]]|[A-Z-[OU]] "
    };
    
    for(std::string test : tests){
       std::cout << " " << test << '\n' 
        << "   -- " << translate_regex(test) << '\n'; 
        // Construct a reegx (validate syntax)
        std::regex(translate_regex(test)); 
    }
    std::string re = translate_regex("[a-z-[aeiou-[e]]]{2,10}");
    show_matches("abcdefghi", re);
    
    return 0;
}

score 1 · Accepted Answer

尝试使用支持XPath的库中的库函数，例如xmlregexp在libxml（是 C 库）中，它可以处理 XML 正则表达式并将它们直接应用于 XML

http://www.xmlsoft.org/html/libxml-xmlregexp.html#xmlRegexp

----> http://web.mit.edu/outland/share/doc/libxml2-2.4.30/html/libxml-xmlregexp.html <----

另一种可能是 PugiXML（C++ 库，我应该在 C++ 中使用什么 XML 解析器？）但是我认为它没有实现 XML 正则表达式功能......

score 0 · Accepted Answer

好的，在完成其他答案之后，我尝试了一些不同的东西，最终使用xmlRegexp了libxml2.

相关函数的xmlRegexp文档记录很差，所以我想我会在这里发布一个示例，因为其他人可能会发现它很有用：

#include <iostream>
#include <libxml/xmlregexp.h>

int main()
{
    LIBXML_TEST_VERSION;

    xmlChar* str = xmlCharStrdup("bcdfg");
    xmlChar* pattern = xmlCharStrdup("[a-z-[aeiou]]+");
    xmlRegexp* regex = xmlRegexpCompile(pattern);

    if (xmlRegexpExec(regex, str) == 1)
    {
        std::cout << "Match!" << std::endl;
    }

    free(regex);
    free(pattern);
    free(str);
}

输出：

匹配！

我也尝试使用XMLString::patternMatch库中的Xerces-C++，但它似乎没有在下面使用符合 XML Schema 的正则表达式引擎。（老实说，我不知道它在下面使用什么正则表达式引擎，并且文档非常糟糕，我在网上找不到任何示例，所以我就放弃了。）

c++ - C ++中的正则表达式字符类减法

4 回答 4

编辑

编辑：递归和命名字符类

Related

Reference