5

我想允许两个主要的通配符?*过滤我的数据。

这是我现在的做法(正如我在许多网站上看到的那样):

public boolean contains(String data, String filter) {
    if(data == null || data.isEmpty()) {
        return false;
    }
    String regex = filter.replace(".", "[.]")
                         .replace("?", ".")
                         .replace("*", ".*");
    return Pattern.matches(regex, data);
}

但是我们不应该转义所有其他正则表达式特殊字符,比如|or(等​​吗?而且,也许我们可以保留?*如果它们前面有一个\? 例如,类似:

filter.replaceAll("([$|\\[\\]{}(),.+^-])", "\\\\$1") // 1. escape regex special chars, but ?, * and \
      .replaceAll("([^\\\\]|^)\\?", "$1.")           // 2. replace any ? that isn't preceded by a \ by .
      .replaceAll("([^\\\\]|^)\\*", "$1.*")          // 3. replace any * that isn't preceded by a \ by .*
      .replaceAll("\\\\([^?*]|$)", "\\\\\\\\$1");    // 4. replace any \ that isn't followed by a ? or a * (possibly due to step 2 and 3) by \\

你怎么看待这件事?如果您同意,我是否缺少任何其他正则表达式特殊字符?


编辑#1(在考虑了 dan1111 和 m.buettner 的建议之后):

// replace any even number of backslashes by a *
regex = regex.replaceAll("(?<!\\\\)(\\\\\\\\)+(?!\\\\)", "*");
// reduce redundant wildcards that aren't preceded by a \
regex = regex.replaceAll("(?<!\\\\)[?]*[*][*?]+", "*");
// escape regexps special chars, but \, ? and *
regex = regex.replaceAll("([|\\[\\]{}(),.^$+-])", "\\\\$1");
// replace ? that aren't preceded by a \ by .
regex = regex.replaceAll("(?<!\\\\)[?]", ".");
// replace * that aren't preceded by a \ by .*
regex = regex.replaceAll("(?<!\\\\)[*]", ".*");

这个如何?


编辑#2(在考虑了 dan1111 的建议之后):

// replace any even number of backslashes by a *
regex = regex.replaceAll("(?<!\\\\)(\\\\\\\\)+(?!\\\\)", "*");
// reduce redundant wildcards that aren't preceded by a \
regex = regex.replaceAll("(?<!\\\\)[?]*[*][*?]+", "*");
// escape regexps special chars (if not already escaped by user), but \, ? and *
regex = regex.replaceAll("(?<!\\\\)([|\\[\\]{}(),.^$+-])", "\\\\$1");
// replace ? that aren't preceded by a \ by .
regex = regex.replaceAll("(?<!\\\\)[?]", ".");
// replace * that aren't preceded by a \ by .*
regex = regex.replaceAll("(?<!\\\\)[*]", ".*");

目标在望?

4

3 回答 3

2

您不需要在替换字符串中使用 4 个反斜杠来写出一个。两个反斜杠就足够了。

您可以通过使用否定的lookbehind来避免替换字符串中的([^\\\\]|^)和:$1

filter.replaceAll("([$|\\[\\]{}(),.+^-])", "\\$1") // 1. escape regex special chars, but ?, * and \
      .replaceAll("(?<!\\\\)[?]", ".")           // 2. replace any ? that isn't preceded by a \ by .
      .replaceAll("(?<!\\\\)[*]", ".*")          // 3. replace any * that isn't preceded by a \ by .*

我真的不明白你需要最后一步做什么。这不会逃避那些逃避你的元字符的反斜杠(反过来,实际上并没有逃避它们)。我忽略了这样一个事实,即您的替换调用会写出 4 个反斜杠而不是只有两个。但是说你原来的输入有th|is. 然后你的第一个替换会做到这一点th\|is。然后最后一个替换将th\\|is匹配th-backslash is.

您需要区分您的字符串在代码中的外观(未编译,反斜杠的数量是两倍)和它在编译后的外观(仅包含一半的反斜杠)。

您可能还想考虑限制可能的*. 像.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*!(在!输入中找不到)这样的正则表达式可能需要很长时间才能运行。这个问题被称为灾难性回溯

于 2012-12-13T15:25:00.433 回答
0

最后是我采用的解决方案(使用Apache Commons Lang库):

public static boolean isFiltered(String data, String filter) {
    // no filter: return true
    if (StringUtils.isBlank(filter)) {
        return true;
    }
    // a filter but no data: return false
    else if (StringUtils.isBlank(data)) {
        return false;
    }
    // a filter and a data:
    else {
        // case insensitive
        data = data.toLowerCase();
        filter = filter.toLowerCase();
        // .matches() auto-anchors, so add [*] (i.e. "containing")
        String regex = "*" + filter + "*";
        // replace any pair of backslashes by [*]
        regex = regex.replaceAll("(?<!\\\\)(\\\\\\\\)+(?!\\\\)", "*");
        // minimize unescaped redundant wildcards
        regex = regex.replaceAll("(?<!\\\\)[?]*[*][*?]+", "*");
        // escape unescaped regexps special chars, but [\], [?] and [*]
        regex = regex.replaceAll("(?<!\\\\)([|\\[\\]{}(),.^$+-])", "\\\\$1");
        // replace unescaped [?] by [.]
        regex = regex.replaceAll("(?<!\\\\)[?]", ".");
        // replace unescaped [*] by [.*]
        regex = regex.replaceAll("(?<!\\\\)[*]", ".*");
        // return whether data matches regex or not
        return data.matches(regex);
    }
}

非常感谢@dan1111 和@m.buettner 的宝贵帮助;)

于 2012-12-14T14:57:07.790 回答
0

试试这个更简单的版本:

String regex = Pattern.quote(filter).replace("*", "\\E.*\\Q").replace("?", "\\E.\\Q");

\Q这用and引用了整个过滤器\E,然后在 and 上停止引用*,用它们的等效模式 ( and )?替换它们.*.

我用它测试过

String simplePattern = "ab*g\\Ei\\.lmn?p";
String data = "abcdefg\\Ei\\.lmnop";
String quotedPattern = Pattern.quote(simplePattern);
System.out.println(quotedPattern);
String regex = quotedPattern.replace("*", "\\E.*\\Q").replace("?", "\\E.\\Q");
System.out.println(regex);
System.out.println(data.matches(regex));

输出:

\Qab*g\E\\E\Qi\.lmn?p\E
\Qab\E.*\Qg\E\\E\Qi\.lmn\E.\Qp\E
true

注意这是基于Oracle的实现Pattern.quote,我不知道是否还有其他有效的实现。

于 2013-03-01T15:02:33.863 回答