c# - 使用 Html Agility Pack 抓取网站。GET 的响应不符合预期

Question

使用 System.Net.HttpRequest 我想在我的代码中模仿用户在以下搜索引擎上的搜索。

http://www.scirus.com

搜索 URL 的示例如下：

http://www.scirus.com/srsapp/search?q=core+facilities&t=all&sort=0&g=s

我有以下代码来执行 HTTP GET。注意我正在使用 HtmlAgilityPack。

protected override HtmlDocument MakeRequestHtml(string requestUrl)
{
    try
    {
        HttpWebRequest request = WebRequest.Create(requestUrl) as HttpWebRequest;
        request.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)";
        HttpWebResponse response = request.GetResponse() as HttpWebResponse;

        HtmlDocument htmlDoc = new HtmlDocument();
        htmlDoc.Load(response.GetResponseStream());
        return (htmlDoc);

    }
    catch (Exception e)
    {
        Console.WriteLine(e.Message);
        Console.Read();
        return null;
    }
}

其中“requestUrl”是上面显示的示例搜索 URL。

htmlDoc.DocumentNode.InnerHtml 的内容不包含搜索结果，并且看起来与将上面显示的示例搜索 URL 复制粘贴到浏览器中时所获得的搜索结果页面完全不同。

我猜这是因为你必须先有一个会话才能执行请求。有人可以建议是否有可行的方法来复制用户代理的行为？或者也许有更好的方法来实现“抓取”我不知道的搜索结果的目标？请提出建议。

Robots.txt 内容：

# / robots.txt file for http://www.scirus.com

User-agent: NetMechanic
Disallow: /srsapp/sciruslink

User-agent: *
Disallow: /srsapp/sciruslink
Disallow: /srsapp/search
Disallow: /srsapp/search_simple
Disallow: /search_simple
# for dev and accept server uncomment below line at Build time to disallow robots completely
##Disallow: /

htmlDoc.DocumentNode.InnerHtml 的内容

score 1 · Accepted Answer

您可能需要设置一个用户代理，例如

request.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)";

您还应该检查网站的 Robots.txt 文件以确保您受到欢迎。

score 1 · Accepted Answer

好的，我实际上是用 webclient 测试的

        static void Main(string[] args)
    {
        WebClient client = new WebClient();
        client.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0");
        string str = client.DownloadString("http://www.scirus.com/srsapp/search?q=core+facilities&t=all&sort=0&g=s"); 
        byte[] bit = new System.Text.ASCIIEncoding().GetBytes(str);
        FileStream fil = File.OpenWrite("test.txt");
        fil.Write(bit,0,bit.Length);
    }

这是下载的文件http://pastebin.com/qswtgC4n

score -1 · Accepted Answer

确保您没有过度 ping 服务器，尤其是在加载文档的代码以前工作的情况下。您可能遇到了将您发送到 robots.txt 或类似页面的服务器规则。

c# - 使用 Html Agility Pack 抓取网站。GET 的响应不符合预期

3 回答 3

Related

Reference