1

我有以下 SOAP XML,我想从中提取所有节点的文本内容:

<soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"
    xmlns:m="http://www.example.org/stock">
    <soap:Body>
        <m:GetStockName>
            <m:StockName>ABC</m:StockName>
        </m:GetStockName>
        <!--some comment-->
        <m:GetStockPrice>
            <m:StockPrice>10 \n </m:StockPrice>
            <m:StockPrice>\t20</m:StockPrice>
        </m:GetStockPrice>
    </soap:Body>
</soap:Envelope>

预期的输出将是:

'ABC10 \n \t20'

我在DOM中完成了以下操作:

public static String parseXmlDom() throws ParserConfigurationException,
        SAXException, IOException, FileNotFoundException {

    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    DocumentBuilder builder = factory.newDocumentBuilder();
    // Read XML File
    String xml = IOUtils.toString(new FileInputStream(new File(
            "./files/request2.xml")), "UTF-8");
    InputSource is = new InputSource(new StringReader(xml));
    // Parse XML String to DOM
    factory.setNamespaceAware(true);
    factory.setIgnoringComments(true);
    Document doc = builder.parse(is);
    // Extract nodes text
    NodeList nodeList = doc.getElementsByTagNameNS("*", "*");
    Node node = nodeList.item(0);
    return node.getTextContent();
}

并使用SAX

public static String parseXmlSax() throws SAXException, IOException, ParserConfigurationException {

    final StringBuffer sb = new StringBuffer();
    SAXParserFactory factory = SAXParserFactory.newInstance();
    SAXParser saxParser = factory.newSAXParser();
    // Declare Handler
    DefaultHandler handler = new DefaultHandler() {
        public void characters(char ch[], int start, int length) throws SAXException {
            sb.append((new String(ch, start, length)));
        }
    };
    // Parse XML
    saxParser.parse("./files/request2.xml", handler);
    return sb.toString();
}

对于我收到的两种方法:

'


            ABC



            10 \n 
            \t20


'

我知道我可以轻松地return sb.toString().replaceAll("\n", "").replaceAll("\t", "");实现预期的结果,但如果我的 XML 文件格式错误,例如有额外的空格,结果也会包含额外的空格。

此外,我已经尝试过这种方法在使用 SAX 或 DOM 解析 XML 之前将其作为单行读取,但它不适用于我的 SOAP XML 示例,因为它会soap:Envelope在有断线 ( xmlns:m) 时修剪属性之间的空格:

<soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"xmlns:m="http://www.example.org/stock"><soap:Body><m:GetStockName><m:StockName>ABC</m:StockName></m:GetStockName><m:GetStockPrice><m:StockPrice>10 \n  </m:StockPrice><m:StockPrice>\t20</m:StockPrice></m:GetStockPrice></soap:Body></soap:Envelope>
[Fatal Error] :1:129: Element type "soap:Envelope" must be followed by either attribute specifications, ">" or "/>".

无论 XML 文件包含在单行还是多个格式良好/错误的行(也忽略注释),我如何才能仅读取 SOAP XML 中所有节点的文本内容?

4

0 回答 0