java - Tagsoup 无法从 StringReader (java) 解析 html 文档

Question

我有这个功能：

private Node getDOM(String str) throws SearchEngineException {

                DOMResult result = new DOMResult();

                try {
                        XMLReader reader = new Parser();
                        reader.setFeature(Parser.namespacesFeature, false);
                        reader.setFeature(Parser.namespacePrefixesFeature, false);
                        Transformer transformer = TransformerFactory.newInstance().newTransformer();
                        transformer.transform(new SAXSource(reader,new InputSource(new StringReader(str))), result);
                } catch (Exception ex) {
                        throw new SearchEngineException("NukatSearchEngine.getDom: " + ex.getMessage());
                }

                return result.getNode();
        }

它需要一个字符串，其中包含 HTTP 服务器在 POST 请求后发送的 html 文档，但无法正确解析它 - 我只从整个文档中获得了四个节点。字符串本身看起来不错 - 如果我将其打印出来并将其复制到文本文档中，我会看到我期望的页面。

当我使用上述方法的重载版本时：

private Node getDOM(URL url) throws SearchEngineException {

                DOMResult result = new DOMResult();

                try {
                        XMLReader reader = new Parser();
                        reader.setFeature(Parser.namespacesFeature, false);
                        reader.setFeature(Parser.namespacePrefixesFeature, false);
                        Transformer transformer = TransformerFactory.newInstance().newTransformer();
                        transformer.transform(new SAXSource(reader, new InputSource(url.openStream())), result);
                } catch (Exception ex) {
                        throw new SearchEngineException("NukatSearchEngine.getDom: " + ex.getMessage());
                }

                return result.getNode();
        }

然后一切正常 - 我得到了一个正确的 DOM 树，但我需要以某种方式从服务器检索 POST 答案。

将字符串存储在文件中并将其读回不起作用 - 仍然得到相同的结果。

可能是什么问题呢？

score 1 · Accepted Answer

1

xml编码可能有问题吗？

于 2010-03-03T22:23:55.810 回答

score 1 · Accepted Answer

This seems like an encoding problem. In the code example of yours that doesn't work you're passing the url as a string into the constructor, which uses it as the systemId, and you get problems with Tagsoup parsing the html. In the example that works you're passing the stream in to the InputSource constructor. The difference is that when you pass in the stream then the SAX implementation can figure out the encoding from the stream.

If you want to test this you could try these steps:

Stream the html you're parsing through a java.io.InputStreamReader and call getEncoding on it to see what encoding it detects.
In your first example code, call setEncoding on the InputSource passing in the encoding that the inputStreamReader reported.
See if the first example, changed to explicitly set the encoding, parses the html correctly.

There's a discussion of this toward the end of an article on using the SAX InputSource.

score 0 · Accepted Answer

要获得 POST 响应，您首先需要执行 POST 请求，new InputSource(url.openStream())可能会打开一个连接并从 GET 请求中读取响应。查看使用 URL 发送 POST 请求。

检查执行 POST 请求并获得响应的其他可能性可能很有趣：

java - Tagsoup 无法从 StringReader (java) 解析 html 文档

3 回答 3

Related

Reference