java - Java XML解析器错误从Word复制/粘贴时无效字符Unicode 0x1A

Question

很抱歉重复发帖。但我之前的帖子是基于 Flex 的：

Flex TextArea - 从 Word 复制/粘贴 - xml 解析中的 unicode 字符无效

但现在我将其发布在 Java 端。

问题是：

我们有一个电子邮件功能（我们应用程序的一部分），我们在其中创建一个 XML 字符串并将其放入队列中。另一个应用程序接收它，解析 XML 并发送电子邮件。

(<BODY>....</BODY)当从 Word 复制/粘贴电子邮件文本时，我们会收到 XML 解析器异常：

Invalid character in attribute value BODY (Unicode: 0x1A)

由于我们也使用 Java，我正在尝试使用以下方法从字符串中删除无效字符：

body = body.replaceAll("‘", "");
body = body.replaceAll("’", "");

//去除无效字符

public String stripNonValidXMLCharacters(String in) {
        StringBuffer out = new StringBuffer(); // Used to hold the output.
        char current; // Used to reference the current character.

        if (in == null || ("".equals(in))) {
            return ""; // vacancy test.
        }
        for (int i = 0; i < in.length(); i++) {
            //NOTE: No IndexOutOfBoundsException caught here; it should not happen.
            current = in.charAt(i); 
            if ((current == 0x9) 
                    || (current == 0xA) 
                    || (current == 0xD) 
                    || ((current >= 0x20) && (current <= 0xD7FF)) 
                    || ((current >= 0xE000) && (current <= 0xFFFD)) 
                    || ((current >= 0x10000) && (current <= 0x10FFFF)))
                out.append(current);
        }
        return out.toString();
    }

//再次剥离

private String stripNonValidXMLCharacter(String in) {      
        if (in == null || ("".equals(in))) { 
            return null;
        }
        StringBuffer out = new StringBuffer(in);
        for (int i = 0; i < out.length(); i++) {
            if (out.charAt(i) == 0x1a) {
                out.setCharAt(i, '-');
            }
        }
        return out.toString();
    }

//如果有特殊字符则替换

 emailText = emailText.replaceAll("[\\u0000-\\u0008\\u000B\\u000C" 
                        + "\\u000E-\\u001F" 
                        + "\\uD800-\\uDFFF\\uFFFE\\uFFFF\\u00C5\\u00D4\\u00EC"
                        + "\\u00A8\\u00F4\\u00B4\\u00CC\\u2211]", " ");
            emailText = emailText.replaceAll("[\\x00-\\x1F]", "");
            emailText = emailText.replaceAll(
                                    "[\\x00-\\x08\\x0b\\x0c\\x0e-\\x1f]", "");
            emailText = emailText.replaceAll("\\p{C}", "");

但它们仍然不起作用。XML 字符串也以：

 <?xml version="1.0" encoding="UTF-8"?>  
                    <EMAILS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNameSpaceSchemaLocation=".\\SMTPSchema.xsd\">

我认为当 Word 文档中有多个选项卡时会出现问题。就像例如。

Text......text
<newLine>
<tab><tab><tab> text...text
<newLine>

生成的 xml 字符串是：

<?xml version="1.0" encoding="UTF-8"?> <EMAILS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNameSpaceSchemaLocation=".\SMTPSchema.xsd"> <EMAIL SOURCE="t@t.com" DEST="t@t.com" CC="" BCC="t@t.com" SUBJECT="test 61" BODY="As such there was no mechanism constructed to migrate the enrollment user base to Data Collection or to keep security attributes for common users in sync between the two systems.  The purpose of this document is to outline two strategies for bring the user base between the two applications into sync.?  It still is the same.  ** Please note: This e-mail message was sent from a notification-only address that cannot accept incoming e-mail. Please do not reply to this message."/> </EMAILS>

请注意“？” 是 Word 文档中有多个选项卡的地方。希望我的问题很清楚，有人可以帮助解决问题

谢谢

score 0 · Accepted Answer

0

您是否尝试过使用诸如 TagSoup / JSoup / JTidy 之类的 XML 库来清理您的 XML？

于 2012-10-22T15:29:15.203 回答

score 0 · Accepted Answer

无效（隐藏）字符来自 UI (Flex TextArea)。所以必须在 UI 中处理好它，以免它也传递给 Java。使用 Flex textArea 中的 chagingHandler 处理并删除它以限制字符。

java - Java XML解析器错误从Word复制/粘贴时无效字符Unicode 0x1A

2 回答 2

Related

Reference