4

我在我的应用程序中使用 sax 解析器将 XML 解析为字符串。当我的代码将 HTML 正文作为字符串发送时,sax 解析器会卡住更长的时间(超过 5 小时)。

页面源网址:“ http://www.cityam.com/taxonomy/term/1/all/feed ”我想解析。此 url 提供 HTML 页面而不是 XML。如何处理此类问题或如何在适当的异常情况下从我的 saxParser 中退出。我的代码在这里

public List<RssEntry> parseDocument(String body) {
    // expected body is xml but getting stuck when get body of html page.
    SAXParserFactory factory = SAXParserFactory.newInstance();
    try {
        SAXParser parser = factory.newSAXParser();
        XMLReader reader = parser.getXMLReader();   
        parser.parse(new ByteArrayInputStream(body.getBytes("UTF-8")), this);
    }

    some catch block

请帮助我。谢谢

4

2 回答 2

1

When my code send HTML body as string then sax parser getting stuck for longer time (more than 5 hour). If i pass body of html page which contains "http://apache.org/xml/features/nonvalidating/load-external-dtd" in dtd are (start of html page) then sax parser got busy to load external-dtd.

so i put these feature as false then sax parser throw an error if xml is not well defined.

XMLReader reader = parser.getXMLReader(); reader.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd",false);

Thanks everybody to help me.

于 2013-04-04T07:54:31.627 回答
0
// expected body is xml but getting stuck when get body of html page.
SAXParserFactory factory = SAXParserFactory.newInstance();
if(!body.startsWith("<?xml")){
    throw new NotXmlInputException(message); //your exception
}

或为您的 xml 创建 shema 文件,并使用验证

SchemaFactory constraintFactory =
        SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Source constraints = new StreamSource(/* your schema */);
Schema schema = constraintFactory.newSchema(constraints);
Validator validator = schema.newValidator();

try {
    validator.validate(/* convert your string to sourse*/);
} catch (org.xml.sax.SAXException e) {
    log("Validation error: " + e.getMessage());
}

或者可以帮助使用

SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(true);
于 2013-03-08T11:49:08.310 回答