android - 在 Android 中抓取 HTML 网页的最快方法是什么？

Question

我需要从 Android 中的非结构化网页中提取信息。我想要的信息嵌入在没有 id 的表中。

<table> 
<tr><td>Description</td><td></td><td>I want this field next to the description cell</td></tr> 
</table>

我应该使用

模式匹配？
使用 BufferedReader 提取信息？

还是有更快的方法来获取这些信息？

score 47 · Accepted Answer

我认为在这种情况下，寻找一种快速提取信息的方法是没有意义的，因为当您将其与下载HTML 所需的时间进行比较时，答案中已经建议的方法之间几乎没有性能差异。

所以假设最快你的意思是最方便、可读和可维护的代码，我建议你使用 a来解析相关的 HTML 并使用sDocumentBuilder提取数据：XPathExpression

Document doc = DocumentBuilderFactory.newInstance()
  .newDocumentBuilder().parse(new InputSource(new StringReader(html)));

XPathExpression xpath = XPathFactory.newInstance()
  .newXPath().compile("//td[text()=\"Description\"]/following-sibling::td[2]");

String result = (String) xpath.evaluate(doc, XPathConstants.STRING);

如果您碰巧检索到无效的 HTML，我建议隔离相关部分（例如使用substring(indexOf("<table")..），并在必要时通过String操作更正剩余的 HTML 错误，然后再进行解析。但是，如果这变得太复杂（即非常糟糕的HTML），请使用其他答案中建议的 hacky 模式匹配方法。

XPath 从 API 级别 8 (Android 2.2) 开始可用。如果您为较低的 API 级别进行开发，您可以使用 DOM 方法和条件来导航到要提取的节点

score 19 · Accepted Answer

最快的方法是自己解析特定信息。您似乎事先就知道 HTML 结构。,BufferedReader和方法应该足够String了。StringBuilder这是一个启动示例，它显示您自己问题的第一段：

public static void main(String... args) throws Exception {
    URL url = new URL("http://stackoverflow.com/questions/2971155");
    BufferedReader reader = null;
    StringBuilder builder = new StringBuilder();
    try {
        reader = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
        for (String line; (line = reader.readLine()) != null;) {
            builder.append(line.trim());
        }
    } finally {
        if (reader != null) try { reader.close(); } catch (IOException logOrIgnore) {}
    }

    String start = "<div class=\"post-text\"><p>";
    String end = "</p>";
    String part = builder.substring(builder.indexOf(start) + start.length());
    String question = part.substring(0, part.indexOf(end));
    System.out.println(question);
}

解析几乎在所有情况下都比模式匹配快。模式匹配更容易，但存在一定的风险，它可能会产生意想不到的结果，尤其是在使用复杂的正则表达式模式时。

您还可以考虑使用更灵活的 3rd 方 HTML 解析器，而不是自己编写。它不会像用事先已知的信息解析自己那么快。然而，它将更加简洁和灵活。使用不错的 HTML 解析器，速度上的差异可以忽略不计。为此，我强烈推荐Jsoup。它支持类 jQuery 的 CSS 选择器。提取问题的第一段就很简单了：

public static void main(String... args) throws Exception {
    Document document = Jsoup.connect("http://stackoverflow.com/questions/2971155").get();
    String question = document.select("#question .post-text p").first().text();
    System.out.println(question);
}

不清楚你在说什么网页，所以我不能给出更详细的例子，你可以如何使用 Jsoup 从特定页面中选择特定信息。如果您仍然无法使用 Jsoup 和CSS 选择器自行计算，请随时在评论中发布 URL，我会建议如何操作。

score 2 · Accepted Answer

当你报废 Html 网页时。你可以为它做两件事。第一个是使用正则表达式。另一个是 Html 解析器。

并非所有人都优选使用正则表达式。因为它在运行时导致逻辑异常。

使用 Html Parser 更复杂。你不能确定正确的输出会到来。根据我的经验，它也造成了一些运行时异常。

所以最好将 url 响应到 Xml 文件。而且做xml解析非常简单有效。

score 1 · Accepted Answer

1

你为什么不写

int start=data.indexOf("描述");

之后获取所需的子字符串。

于 2010-06-15T23:42:34.497 回答

score 0 · Accepted Answer

为什么不创建一个使用 cURL 和简单的 html dom 解析器进行抓取的脚本，然后从该页面中获取您需要的值？这些工具可与 PHP 一起使用，但您需要的任何语言都存在其他工具。

score 0 · Accepted Answer

这样做的一种方法是将 html 放入一个字符串中，然后手动搜索和解析该字符串。如果您知道标签将按特定顺序出现，那么您应该能够爬过它并找到数据。然而，这有点草率，所以你想让它现在工作吗？还是工作得好？

int position = (String)html.indexOf("<table>");  //html being the String holding the html code
String field = html.substring(html.indexOf("<td>",html.indexOf("<td>",position)) + 4, html.indexOf("</td>",html.indexOf("</td>",position)));

就像我说的......真的很草率。但是，如果您只这样做一次并且需要它来工作，那么这可能会奏效。

android - 在 Android 中抓取 HTML 网页的最快方法是什么？

6 回答 6

Related

Reference