我有以下 SOAP XML,我想从中提取所有节点的文本内容:
<soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"
xmlns:m="http://www.example.org/stock">
<soap:Body>
<m:GetStockName>
<m:StockName>ABC</m:StockName>
</m:GetStockName>
<!--some comment-->
<m:GetStockPrice>
<m:StockPrice>10 \n </m:StockPrice>
<m:StockPrice>\t20</m:StockPrice>
</m:GetStockPrice>
</soap:Body>
</soap:Envelope>
预期的输出将是:
'ABC10 \n \t20'
我在DOM中完成了以下操作:
public static String parseXmlDom() throws ParserConfigurationException,
SAXException, IOException, FileNotFoundException {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
// Read XML File
String xml = IOUtils.toString(new FileInputStream(new File(
"./files/request2.xml")), "UTF-8");
InputSource is = new InputSource(new StringReader(xml));
// Parse XML String to DOM
factory.setNamespaceAware(true);
factory.setIgnoringComments(true);
Document doc = builder.parse(is);
// Extract nodes text
NodeList nodeList = doc.getElementsByTagNameNS("*", "*");
Node node = nodeList.item(0);
return node.getTextContent();
}
并使用SAX:
public static String parseXmlSax() throws SAXException, IOException, ParserConfigurationException {
final StringBuffer sb = new StringBuffer();
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
// Declare Handler
DefaultHandler handler = new DefaultHandler() {
public void characters(char ch[], int start, int length) throws SAXException {
sb.append((new String(ch, start, length)));
}
};
// Parse XML
saxParser.parse("./files/request2.xml", handler);
return sb.toString();
}
对于我收到的两种方法:
'
ABC
10 \n
\t20
'
我知道我可以轻松地return sb.toString().replaceAll("\n", "").replaceAll("\t", "");
实现预期的结果,但如果我的 XML 文件格式错误,例如有额外的空格,结果也会包含额外的空格。
此外,我已经尝试过这种方法在使用 SAX 或 DOM 解析 XML 之前将其作为单行读取,但它不适用于我的 SOAP XML 示例,因为它会soap:Envelope
在有断线 ( xmlns:m
) 时修剪属性之间的空格:
<soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"xmlns:m="http://www.example.org/stock"><soap:Body><m:GetStockName><m:StockName>ABC</m:StockName></m:GetStockName><m:GetStockPrice><m:StockPrice>10 \n </m:StockPrice><m:StockPrice>\t20</m:StockPrice></m:GetStockPrice></soap:Body></soap:Envelope>
[Fatal Error] :1:129: Element type "soap:Envelope" must be followed by either attribute specifications, ">" or "/>".
无论 XML 文件包含在单行还是多个格式良好/错误的行(也忽略注释),我如何才能仅读取 SOAP XML 中所有节点的文本内容?