java - 扩展一个基本的网络爬虫来过滤状态代码和 HTML

Question

我遵循了一个关于用 Java 编写基本网络爬虫的教程，并获得了一些具有基本功能的东西。

目前它只是从站点检索 HTML 并将其打印到控制台。我希望扩展它，以便它可以过滤掉 HTML 页面标题和 HTTP 状态代码等细节？

我找到了这个库： http ://htmlparser.sourceforge.net/ ...我认为它可以为我完成这项工作，但我可以在不使用外部库的情况下完成它吗？

这是我到目前为止所拥有的：

public static void main(String[] args) {

    // String representing the URL
    String input = "";

    // Check if argument added at command line
    if (args.length >= 1) {
        input = args[0];
    }

    // If no argument at command line use default
    else {
        input = "http://www.my_site.com/";
        System.out.println("\nNo argument entered so default of " + input
                + " used: \n");
    }
    // input test URL and read from file input stream
    try {

        testURL = new URL(input);
        BufferedReader reader = new BufferedReader(new InputStreamReader(
                testURL.openStream()));

        // String variable to hold the returned content
        String line = "";

        // print content to console until no new lines of content
        while ((line = reader.readLine()) != null) {
            System.out.println(line);
        }
    } catch (Exception e) {

        e.printStackTrace();
        System.out.println("Exception thrown");
    }
}

score 1 · Accepted Answer

肯定有用于 HTTP 通信的工具。但是，如果您更喜欢自己实现一个 - 查看 java.net.HttpURLConnection。它将为您提供对 HTTP 通信的更细粒度的控制。这是给你的一个小样本：

public static void main(String[] args) throws IOException
{
  URL url = new URL("http://www.google.com");
  HttpURLConnection connection = (HttpURLConnection) url.openConnection();

  connection.setRequestMethod("GET");

  String resp = getResponseBody(connection);

  System.out.println("RESPONSE CODE: " + connection.getResponseCode());
  System.out.println(resp);
}

private static String getResponseBody(HttpURLConnection connection)
    throws IOException
{
  try
  {
    BufferedReader reader = new BufferedReader(new InputStreamReader(
        connection.getInputStream()));

    StringBuilder responseBody = new StringBuilder();
    String line = "";

    while ((line = reader.readLine()) != null)
    {
      responseBody.append(line + "\n");
    }

    reader.close();
    return responseBody.toString();
  }
  catch (IOException e)
  {
    e.printStackTrace();
    return "";
  }
}

java - 扩展一个基本的网络爬虫来过滤状态代码和 HTML

1 回答 1

Related

Reference