1

我有一种情况,我需要读取多个 xml 文件并从中构建一个模型。可悲的是,这些文件是由我绝对无法更改的遗留系统生成的。

给我带来麻烦的 XML 文件之一看起来或多或少像这样(更改为删除专有数据):

<resource lang="en" dataId="900">
 numbered content here, 900-919 ...

    <string name="920-name">Document Shredder</string>
    <string name="920-desc">A machine ideal for destroying documents that deserve it. It can cross-shred anything from tissue paper to small netbooks with minimal noise. Remember, hackers can't access the documents if you've shredded the drives.</string>
    <string name="920-cat">office,appliance</string>
    <string name="921-name">Plastic Ladle</string>
    <string name="921-desc">This is a big plastic ladle, ideal for soups and sauces.</string>
    <string name="921-cat">kitchen,utensils</string>

... similar numbered content here, 922-934 ...

    <string name="935-name">Green Laser Pointer</string>
    <string name="935-desc">A High-Powered green laser pointer, ideal for irritating cats.</string>
    <string name="935-cat">office,tool</string>
    <string name="936-name">Black Metal Filing Cabinet</string>
    <string name="936-desc">A large, metal cabinet (black) built to store hanging file folders.</string>
    <string name="936-cat">office,storage</string>

... similar numbered content here, 937-994
</resource>

我将其解析为 a List<CString>,其中CString.java是:

public class CString {
    public String name;
    public String desc;

    @Override
    public String toString() {
        return "CString {!name: " + name + " !body: " + body + "}\n";
    }
}

我试过使用DocumentBuilder, 并且,当它不能正常工作时,只是一个普通的SaxParser. 但是,无论我怎么做,当我回顾我CString的 s 时,我有一些正文实际上包含文档不同部分的未解析标签。例如,打印出我前面提到的List<CString>可能会产生类似的结果:

[ CStrings for 900-919 ...

, CString {!name: 920-name !body: Document Shredder}
, CString {!name: 920-desc !body: irritating cats.</string>
    <string name="935-cat">office,tool</string>
    <string name="936-name">Black Metal Filing Cabinet</e. Remember, hackers can't access the documents if you've shredded the drives.}
, CString {!name: 920-cat !body: office,appliance}
, CString {!name: 921-name !body: Plastic Ladle}
, CString {!name: 921-desc !body: This is a big plastic ladle, ideal for soups and sauces.}
, CString {!name: 921-cat !body: kitchen,utensils}

... CStrings for 922-934 ... 

, CString {!name: 935-name !body: Green Laser Pointer}
, CString {!name: 935-desc !body: A High-Powered green laser pointer, ideal for irritating cats.}
, CString {!name: 935-cat !body: office,tool}
, CString {!name: 936-name !body: Black Metal Filing Cabinet}
, CString {!name: 936-desc !body: A large, metal cabinet (black) built to store hanging file folders.}
, CString {!name: 936-cat !body: office,storage}

... CStrings for 937-994
]

SaxParser我的代码版本中,我有以下characters方法DefaultHandler

public void characters(char ch[], int start, int length) throws SAXException {
    String value = new String(ch, start, length).trim();
    switch(currentQName.toString()) { // currentQName is a StringBuilder that holds just the current xml element's name
        case "string":
            if (value.contains("</string")) {
                System.err.println("!!! Parse Error !!! " + value);
            }
}

正如您可能已经猜到的那样,这会产生:

!!! Parse Error !!! irritating cats.</string>
        <string name="935-cat">office,tool</string>
        <string name="936-name">Black Metal Filing Cabinet</e. Remember, hackers can't access the documents if you've shredded the drives.

我通常不会问这个深奥的问题,尤其是当我无法提供具体的数据和代码时,但谷歌搜索似乎没有产生任何我能够确定的东西,当然代码不是抛出(或抑制)任何异常。

我注意到的一件事是,当有错误的数据时,如上面针对 920-desc 的 CString 所示,在这种情况下,错误数据的长度为 138 个字符,而且并非巧合的是,好的数据恰好将 139 个字符提取到它的内容中应该。这让我觉得这是某种缓冲问题。但是,无论我是让DocumentBuilder管理缓冲区,还是尝试使用直接更手动地管理它们,每次SaxParser我仍然会在相同的地方得到完全相同的错误文本。最后,在处理较短的字符串、name 和 cat 时,我从未注意到任何错误的文本,我认为这也指向 char 缓冲区问题。

任何想法都会有所帮助!

4

2 回答 2

0

我在代码中发现了一个特殊字符被不必要地清理的地方(我想是为了解决以前的源代码格式不佳的问题)。

这是之前进行所有剥离的方法:

private static InputSource getCleanSource(File file) {
    InputSource source = null;
    try {
        InputStream stream = new FileInputStream(file);
        String fileText = readFile(stream); // Gets file content as text from InputStream

        CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder();
        utf8Decoder.onMalformedInput(CodingErrorAction.IGNORE);
        utf8Decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
        CharBuffer parsed = utf8Decoder.decode(ByteBuffer.wrap(readFile(stream).getBytes()));

        fileText = "<?xml version=\"1.1\" encoding=\"UTF-8\" ?>\n" + // put a good header
                parsed
                .replaceAll("<\\?.*?\\?>", "") // remove bad <?xml> tags
                .replaceAll("--+","--") // can't have <!--- text --->
                .replaceFirst("(?s)^.+?<\\?", "<?") // remove bad stuff before <?xml> tag
                .replaceAll("[^\\x20-\\x7e\\x0A]", "") // remove bad characters
                .replaceAll("[\\x0A]", " ") // remove line breaks
                ;
        Reader reader = new StringReader(fileText);
        source = new InputSource(reader);
    } catch (Throwable t) {
        System.err.println("Unknown trouble parsing: " + file.getName());
        t.printStackTrace();
    }

    return source;
}

经过审查和调整后,如果我将此方法更改为:

private static InputSource getCleanSource(File file) {
    InputSource source = null;
    try {
        InputStream stream = new FileInputStream(file);
        String fileText = readFile(stream) // Gets file content as text from InputStream
                .replaceAll("--+","--") // can't have <!--- text --->
                .replaceFirst("(?s)^.+?<\\?", "<?") // remove bad stuff before <?xml> tag
                ;
        Reader reader = new StringReader(fileText);
        source = new InputSource(reader);
    } catch (Throwable t) {
        System.err.println("Unknown trouble parsing: " + file.getName());
        t.printStackTrace();
    }

    return source;
}

我还没有时间回去尝试弄清楚清洁过程中吞噬了哪些神秘字符或标签。我不得不假设源系统最初提供的有效 xml 比现在需要如此积极的清理要少得多,但我认为我永远无法确定。

于 2013-01-28T20:27:57.583 回答
0

几乎可以肯定,您没有格式良好的 XML(您关于绝对不允许更改源系统的评论是不祥之兆,但您并不是唯一一个陷入这种困境的人。)

看看这个问题如何在 Java 中解析格式错误的 XML?

如果我是你,我会使用原始字符串操作和/或正则表达式来直接提取数据或将其修复为格式良好的 XML。顺便说一句,JAXB 在处理 Java 中的 XML 方面要好得多(但仍然需要格式良好)

于 2013-01-15T22:47:24.913 回答