c# - C#在jquery之后抓取正确的网页内容

Question

我使用 HtmlAgilityPack 已经有一段时间了，但是我一直在使用的网络资源现在有一个（看起来像）浏览器通过的 jQuery 协议。我期望加载的是产品页面，但实际加载的（通过 WebBrowser 控件和 WebClient DownloadString 验证）是重定向，要求访问者选择顾问并与他们注册。

换句话说，使用 Chrome 的 Inspect >> Elements 工具，我得到：

<div data-v-1a7a6550="" class="product-extra-images">
  <img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/10174_1MainImage-White-9-14_1.jpg.100x100_q85_crop_upscale.jpg" width="50">
  <img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/10174_2Image2-White-9-14_1.jpg.100x100_q85_crop_upscale.jpg" width="50">

但是 WebBrowser 和 HTMLAgilityPack 只能得到：

<div class="container content">
  <div class="alert alert-danger " role="alert">
    <button type="button" class="close" data-dismiss="alert">
      <span aria-hidden="true">&times;</span>
    </button>
    <h2 style="text-align: center; background: none; padding-bottom: 0;">It looks like you haven't selected a Consultant yet!</h2>
    <p style="text-align: center;"><span>...were you just wanting to browse or were you looking to shop and pick a Consultant to shop under?</span></p>
      <div class="text-center">
        <form action="/just-browsing/" method="POST" class="form-inline">
   ...

在深入研究头部的类定义后，我发现页面确实使用 jQuery 来处理正确的加载，并在访问者浏览页面时处理操作（滚动、调整大小、悬停在图像上、选择其他图像等）。这是来自jQuery的头部：

/*!
* jQuery JavaScript Library v2.1.4
* http://jquery.com/
*
* Includes Sizzle.js
* http://sizzlejs.com/
*
* Copyright 2005, 2014 jQuery Foundation, Inc. and other contributors
* Released under the MIT license
* http://jquery.org/license
*
* Date: 2015-04-28T16:01Z
*/

我尝试了 ScrapySharp，如下所述： C# .NET：抓取动态（JS）网站

但这最终消耗了所有可用内存并且从未产生任何东西。

还有这个： htmlagilitypack 和动态内容问题如上所述加载了不正确的重定向。

如果需要，我可以提供更多我试图从中提取的源，包括完整的 jQuery。

score 1 · Accepted Answer

用于CaptureRedirect = false;绕过重定向页面。这对您提到的页面有用：

var web = new HtmlWeb();
web.CaptureRedirect = false;
web.BrowserTimeout = TimeSpan.FromSeconds(15);

现在继续尝试，直到在页面上看到文本“产品描述”。

var doc = web.LoadFromBrowser(url, html =>
{
    return html.Contains("Product Description");
});

最新版本的 HtmlAgilityPack 可以在后台运行浏览器。所以我们真的不需要像 ScrapySharp 这样的库来抓取动态内容。

c# - C#在jquery之后抓取正确的网页内容

1 回答 1

Related

Reference