1

我一直在使用 JAXB 来解析看起来大致如下所示的 xml:

<report>    <-- corresponds to a "wrapper" object that holds 
                some properties and two lists - a list of A's and list of B's
    <some tags with> general <info/>
    ...
    <A>   <-- corresponds to an "A" object with some properties
        <some tags with> info related to the <A> tag <bla/>
        ...
    <A/>
    <B>   <-- corresponds to an "B" object with some properties
        <some tags with> info related to the <B> tag <bla/>
        ...
    </B>
</report>

负责编组 xml 的一方很糟糕,但我无法控制。
它经常发送无效的 xml 字符和/或格式错误的 xml。
我与负责方交谈并修复了许多错误,但有些错误似乎无法修复。
我希望我的解析器尽可能宽恕这些错误,并且在不可能的情况下,从包含错误的 xml 中获取尽可能多的信息。
因此,如果 xml 包含 100 个 A 并且一个有问题,我仍然希望能够保留其他 99 个。
这些是我最常见的问题:

1. Some info tag inner value contains invalid chars
    <bla> invalid chars here, either control chars or just &>< </bla>
2. The root entity is missing a closing tag
    <report> ..... stuff here .... NO </report> at the end!
3. An inner entity (A/B)  is missing it's closing tag, or it's somehow malformed.
    <A> ...stuff here... <somethingMalformed_blabla_A/>
    OR
    <A> ...  Something malformed here...</A>

我希望我能很好地解释自己。
我真的很想从这些 xml 中获取尽可能多的信息,即使它们有问题。
我想我需要采用一些使用 stax/sax 和 JAXB 的策略,但我不确定如何。
如果在 100 个 A 中,一个 A 有 xml 问题,我不介意只扔掉那个 A。
虽然如果我能得到一个 A 对象,该对象的数据量尽可能多,直到出现错误之前都可以解析。

4

2 回答 2

2

The philosphy of XML is that creators of XML are responsible for creating well-formed XML, recipients are not responsible for repairing bad XML on arrival. XML parsers are required to reject ill-formed XML. There are other "tidy" tools that may be able to convert bad XML into good XML, but depending on the nature of the flaws in the input, it's unpredictable how well they will work. If you're going to get the benefits of using XML for data interchange, it needs to be well-formed. Otherwise you might just as well use your own proprietary format.

于 2012-07-16T11:06:52.613 回答
2

This answer really helped me:

JAXB - unmarshal XML exception

In my case, I'm parsing results from Sysinternals Autoruns tool with the XML switch (-x). Either because the results were being written to a file share or for some buggy reason in the newer version, the XML would be malformed near the end. Since this Autoruns capture is critical for malware investigations, I really wanted the data. Plus I could tell from the file size that the results were nearly complete.

The solution in the linked question works really well when you have a document with many sub-elements as suggested by the OP. In particular, the Autoruns XML output is really simple and consists of many "items", each consisting of a many simple elements with text (i.e. String properties as generated by XJC). So if a few items are missed at the end, no big deal... unless of course it's something related to malware. :)

Here's my code:

public class Loader {

    private List<Exception> exceptions = new ArrayList<>();

    public synchronized List<Exception> getExceptions() {
        return new ArrayList<>(exceptions);
    }

    protected void setExceptions(List<Exception> exceptions) {
        this.exceptions = exceptions;
    }

    public synchronized Autoruns load(File file, boolean attemptRecovery)
      throws LoaderException {
        Unmarshaller unmarshaller;
        try {
            JAXBContext context = newInstance(Autoruns.class);
            unmarshaller = context.createUnmarshaller();
        } catch (JAXBException ex) {
            throw new LoaderException("Could not create unmarshaller.", ex);
        }
        try {
            return (Autoruns) unmarshaller.unmarshal(file);
        } catch (JAXBException ex) {
            if (!attemptRecovery) {
                throw new LoaderException(ex.getMessage(), ex);
            }
        }
        exceptions.clear();
        Autoruns autoruns = new Autoruns();
        XMLInputFactory inputFactory = XMLInputFactory.newInstance();
        try {
            XMLEventReader eventReader = 
              inputFactory.createXMLEventReader(new FileInputStream(file));
            while (eventReader.hasNext()) {
                XMLEvent event = eventReader.peek();
                if (event.isStartElement()) {
                    StartElement start = event.asStartElement();
                    if (start.getName().getLocalPart().equals("item")) {
                         // note the try should allow processing of elements
                         // after this item in the event it is malformed
                         try {
                            JAXBElement<Autoruns.Item> jax_b = 
                              unmarshaller.unmarshal(eventReader,
                                                     Autoruns.Item.class);
                            autoruns.getItem().add(jax_b.getValue());
                        } catch (JAXBException ex) {
                            exceptions.add(ex);
                        }
                    }
                }
                eventReader.next();
            }
        } catch (XMLStreamException | FileNotFoundException ex) {
            exceptions.add(ex);
        }
        return autoruns;
    }

    public static Autoruns load(Path path) throws JAXBException {
        return load(path.toFile());
    }

    public static Autoruns load(File file) throws JAXBException {
        JAXBContext context = JAXBContext.newInstance(Autoruns.class);
        Unmarshaller unmarshaller = context.createUnmarshaller();
        return (Autoruns) unmarshaller.unmarshal(file);
    }

    public static class LoaderException extends Exception {

        public LoaderException(String message) {
            super(message);
        }

        public LoaderException(String message, Throwable cause) {
            super(message, cause);
        }
    }
}
于 2015-03-31T17:34:25.800 回答