java - 当程序尝试启动与 Google 的连接时，HTTP 响应 403？

Question

我写了一个测试网络爬虫类，尝试搜索谷歌，如图：

public class WebCrawler {
String query;

public WebCrawler(String search)
{
    query = search;
}

public void connect()
{
    HttpURLConnection connection = null;
    try 
    {
        String url = "http://www.google.com/search?q=" + query;
        URL search = new URL(url);

        connection = (HttpURLConnection)search.openConnection();
        connection.setRequestMethod("GET");
        connection.setDoOutput(true);
        connection.setDoInput(true);
        connection.setUseCaches(false);
        connection.setAllowUserInteraction(false);
        connection.connect();

        BufferedReader read = new BufferedReader(new InputStreamReader(connection.getInputStream()));
        String line = null;
        while((line = read.readLine())!=null)
        {
            System.out.println(line);
        }

        read.close();
    }

    catch(MalformedURLException e)
    {
        e.printStackTrace();
    }
    catch(ProtocolException e)
    {
        e.printStackTrace();
    }
    catch(IOException e)
    {
        e.printStackTrace();
    }
    finally
    {
        connection.disconnect();
    }
}

}

但是，当我尝试使用测试查询“test”运行它时，我收到 HTTP 响应 403 错误——我错过了什么？这是我第一次用 Java 做任何网络工作。

score 1 · Accepted Answer

403 == 禁止，这是有道理的，因为你是一个机器人，试图访问他们不希望机器人访问的部分谷歌。谷歌的 robots.txt非常清楚地指出你不应该抓取 /search。

Google 提供了一个搜索 API，每天允许 100 个查询。它们提供了库和示例，说明如何在包括 Java 在内的大多数语言中与之交互。不仅如此，你还得付钱。

java - 当程序尝试启动与 Google 的连接时，HTTP 响应 403？

1 回答 1

Related

Reference