java - 如何检索搜索引擎查询结果的 HTML？

Question

我正在尝试使用 Java 检索 Google 搜索查询结果的 html。也就是说，如果我在 Google.com 中搜索特定短语，我想检索结果网页的 html（包含指向可能匹配项的链接及其描述、URL 等的页面）。

我尝试使用在相关帖子中找到的以下代码来执行此操作：

import java.io.*;
import java.net.*;
import java.util.*;

public class Main {

    public static void main (String args[]) {

        URL url;
        InputStream is = null;
        DataInputStream dis;
        String line;

        try {
            url = new URL("https://www.google.com/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951");
            is = url.openStream();  // throws an IOException
            dis = new DataInputStream(new BufferedInputStream(is));

            while ((line = dis.readLine()) != null) {
                System.out.println(line);
            }
        } catch (MalformedURLException mue) {
             mue.printStackTrace();
        } catch (IOException ioe) {
             ioe.printStackTrace();
        } finally {
            try {
                is.close();
            } catch (IOException ioe ) {
                // nothing to see here
            }
        }
    }
}

来自：你如何以编程方式在 Java 中下载网页

此代码中使用的 URL 是通过从 Google 主页执行 Google 搜索查询获得的。出于某种原因，我不明白，如果我在 Web 浏览器的 URL 栏中写下要搜索的短语，然后在代码中使用生成的搜索结果页面的 URL，我会收到 403 错误。

但是，此代码没有返回搜索查询结果页面的 html。相反，它返回了 Google 主页的源代码。

经过进一步研究，我注意到如果您查看谷歌搜索查询结果的源代码（通过右键单击搜索结果页面的背景并选择“查看页面源”）并将其与谷歌主页的源代码进行比较, 它们都是相同的。

如果不是查看搜索结果页面的源代码，而是保存搜索结果页面的 html（通过按 ctrl+s），我可以获得我正在寻找的 html。

有没有办法使用 Java 检索搜索结果页面的 html？

谢谢！

score 2 · Accepted Answer

与其从标准的谷歌搜索解析生成的 HTML 页面，不如查看官方的自定义搜索 api以以更可用的格式从谷歌返回结果。API 绝对是要走的路；否则，如果 Google 要更改 google.com 前端 html 的某些功能，您的代码可能会简单地中断。该 API 旨在供开发人员使用，您的代码将不那么脆弱。

但是，要回答您的问题：仅根据您提供的信息，我们无法真正为您提供帮助。您的代码似乎检索了 stackoverflow 的 html；从您链接到的问题中精确复制和粘贴代码。您是否尝试过更改代码？您实际尝试使用哪个 URL 来检索 Google 搜索结果？

我尝试使用运行您的代码url = new URL("http://www.google.com/search?q=test");，但我个人收到 HTTP 错误 403 禁止。对问题的快速搜索表明，如果我没有在 Web 请求中提供 User-Agent 标头，则会发生这种情况，尽管如果您实际上返回的是 HTML ，这并不能完全帮助您。如果您希望获得特定帮助，则必须提供更多信息 - 尽管切换到自定义搜索 API 可能会解决您的问题。

编辑：原始问题中提供的新信息；现在可以直接回答问题了！

在对 java 发送的 web 请求进行数据包捕获并应用一些基本调试后，我发现了你的问题......让我们来看看！

这是 Java 使用您提供的示例 URL 发送的 Web 请求：

GET / HTTP/1.1
User-Agent: Java/1.6.0_30
Host: www.google.com
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive

请注意，该请求似乎忽略了大部分 URL……只留下“GET /”。这很奇怪。我不得不查一下这个。

根据 Java URL 类的文档（这是所有网页的标准），A URL may have appended to it a "fragment", also known as a "ref" or a "reference". The fragment is indicated by the sharp sign character "#" followed by more characters ... This fragment is not technically part of the URL.

让我们看一下您的示例网址...

https://www.google.com/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951

请注意“#”是文件路径中的第一个字符？Java 只是忽略了“#”之后的所有内容，因为尖号仅由客户端/Web 浏览器使用 - 这给您留下了 url https://www.google.com/。嘿，至少它按预期工作！

我不能确切地告诉你谷歌在做什么，但尖锐的符号 url 绝对意味着谷歌正在通过一些客户端（ajax / javascript）脚本返回查询结果。我敢打赌，如果没有正确的标头，您直接发送到服务器的任何查询（即没有“#”符号）都会返回 403 禁止错误 - 看起来他们鼓励您使用 API :)

编辑2：根据张腾吉对问题的回答，这里是返回谷歌查询“测试”结果的工作代码

    URL url;
    InputStream is = null;
    DataInputStream dis;
    String line;
    URLConnection c;

    try {
        url = new URL("https://www.google.com/search?q=test");
        c = url.openConnection();
        c.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
        c.connect();
        is = c.getInputStream();
        dis = new DataInputStream(new BufferedInputStream(is));
        while ((line = dis.readLine()) != null) {
            System.out.println(line);
        }
    } catch (MalformedURLException mue) {
         mue.printStackTrace();
    } catch (IOException ioe) {
         ioe.printStackTrace();
    } finally {
        try {
            is.close();
        } catch (IOException ioe ) {
            // nothing to see here
        }
    }

score 1 · Accepted Answer

我建议你试试http://seleniumhq.org/

google上有一个很好的搜索教程

http://code.google.com/p/selenium/wiki/GettingStarted

score -1 · Accepted Answer

您没有在代码中设置用户代理。

URLConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");

或者您可以阅读“http://www.google.com/robots.txt”。该文件告诉您 google 服务器允许哪个 url。

下面的代码是成功的。

package org.test.stackoverflow;

import java.io.*;
import java.net.*;
import java.util.*;

public class SearcherRetriver {
    public static void main (String args[]) {

        URL url;
        InputStream is = null;
        DataInputStream dis;
        String line;
        URLConnection c;

        try {
            url = new URL("https://www.google.com.hk/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951");
            c = url.openConnection();
            c.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
            c.connect();
            is = c.getInputStream();
            dis = new DataInputStream(new BufferedInputStream(is));
            while ((line = dis.readLine()) != null) {
                System.out.println(line);
            }
        } catch (MalformedURLException mue) {
             mue.printStackTrace();
        } catch (IOException ioe) {
             ioe.printStackTrace();
        } finally {
            try {
                is.close();
            } catch (IOException ioe ) {
                // nothing to see here
            }
        }
    }
}

java - 如何检索搜索引擎查询结果的 HTML？

3 回答 3

Related

Reference