java - 使用正则表达式匹配包含无效字符的 XML 标记

Question

我有一个 java 程序，它以 xml 作为输入，并使用正则表达式检查标签内允许的字符列表，并且应该返回一个包含除允许字符之外的整个标签，比如这个特殊字符

XML 输入

<?xml version="1.0"?>
<PayLoad>
<requestRows>****</requestRows>
<requestRowLength>1272</requestRowLength>
<exceptionTimestamp>2012070202281068-0700</exceptionTimestamp>
<exceptionTimestamp>201$2070202281068-0700</exceptionTimestamp>
<exceptionTimestamp>20120(702022810680700</exceptionTimestamp>
<exceptionDetail>NO DATA AVAILABLE FOR TIME PERIOD SPECIFIED   =</exceptionDetail>
</PayLoad>

允许的字符列表

\! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~

我试过如下

public static void main(String args[])
    {
        List<String> specialCharList = new ArrayList<String>();
        try{
                String responseXml="test";
                String SPECIAL_CHARACTER ="(<[\\w\\d]*>(?=[^<]*[^<\\w\\!\\#\\$\\%\\&\\\'\\(\\)\\\\ \\*\\\"\\+\\,\\-\\~\\}\\{\\.\\/\\:\\;\\=\\?\\@\\]\\[\\\\\\`\\|]).*</[\\w\\d]*>)";
                if (!(responseXml == null || responseXml.toString().length() < 1 || responseXml.toString().equals("")))
                {
                    Pattern patternObject = Pattern.compile(SPECIAL_CHARACTER);   
                    Matcher patternMatcher = patternObject.matcher(responseXml);   
                    while(patternMatcher.find())
                    {   
                      specialCharList.add(patternMatcher.group());
                    }
                    if(specialCharList.isEmpty() || specialCharList.size()<0)
                    {
                        specialCharList.add("No Special Character's Detected");
                    }
                }
        }catch(Exception e)
        {

        }
        System.out.println(specialCharList);
    }

但它没有按预期工作。如何为上述场景编写正则表达式？请帮助我

score 0 · Accepted Answer

首先让我们解决您的正则表达式中的几个错误

(<[\w\d]*>(?=[^<]*[^<\w\!\#\$\%\&\'\(\)\ \*\"\+\,\-\~\}\{\.\/\:\;\=\?\@\]\[\\\`\|]).*</[\w\d]*>)

The\w是[A-Za-z0-9_]; 在同一个字符集中同时包含\w和是多余的，所以可以简化为.\d[...]<[\w\d]*><[\w]*>

此外，您不需要在字符集中进行任何转义，除了反斜杠“\”、右括号“]”，有时还有减号“-”。对于空白字符，您应该使用\s- 它包括单个空格、制表符空格和换行符。因此，简化：

[^<\w\!\#\$\%\&\'\(\)\ \*\"\+\,\-\~\}\{\.\/\:\;\=\?\@\]\[\\\`\|]

变成

[^<\w!#$%&'()\s*"+,-~}{./:;=?@\][\\`|]

以下正则表达式效率不高：字符排除总是会大大增加计算量。它将匹配包含一个或多个“无效”字符的标签。

<[\w]*>((?!<[\w]*>).)*[^<\w!#$%&'()\s*"+,-~}{./:;=?@\][\\`|]((?!<[\w]*>).)*</[\w]*>

值得注意的是，如果性能很重要，你真的应该重新考虑你的错误解析方法。

这个网站提供了一个很好的正则表达式介绍

http://www.regular-expressions.info/tutorial.html

java - 使用正则表达式匹配包含无效字符的 XML 标记

1 回答 1

Related

Reference