java - HtmlUnitDriver 在获取 url 时导致问题

Question

我有一个使用 Selenium 库用 Java 开发的页面爬虫。爬虫通过一个网站，该网站通过 Javascript 3 应用程序启动，这些应用程序在弹出窗口中显示为 HTML。

爬虫在启动 2 个应用程序时没有问题，但在第 3 个应用程序中，爬虫永远冻结。

我使用的代码类似于

public void applicationSelect() {
  ...
  //obtain url by parsing tag href attributed
  ...

  this.driver = new HtmlUnitDriver(BrowserVersion.INTERNET_EXPLORER_8);
  this.driver.seJavascriptEnabled(true);
  this.driver.get(url); //the code does not execute after this point for the 3rd app
  ...
}

我还尝试通过以下代码单击 web 元素

public void applicationSelect() {
  ...
  WebElement element = this.driver.findElementByLinkText("linkText");
  element.click(); //the code does not execute after this point for the 3rd app
  ...
}

单击它会产生完全相同的结果。对于上面的代码，我确保我得到了正确的元素。

谁能告诉我我遇到的问题可能是什么？

在应用程序方面，我不能透露有关 html 代码的任何信息。我知道这使尝试解决问题变得更加困难，为此我提前道歉。

=== 2013-04-10 更新 ===

因此，我将源代码添加到了我的爬虫中，并在 this.driver.get(url) 中看到了它被卡住的位置。

基本上，驱动程序会在无限刷新循环中丢失。在由 HtmlUnitDriver 实例化的 WebClient 对象中，加载了一个 HtmlPage，该 HtmlPage 不断刷新，似乎没有尽头。

这是来自 com.gargoylesoftware.htmlunit 的 WaitingRefreshHandler 的代码：

public void handleRefresh(final Page page, final URL url, final int requestedWait) throws IOException {
  int seconds = requestedWait;
  if (seconds > maxwait_ && maxwait_ > 0) {
    seconds = maxwait_;
  }
  try {
    Thread.sleep(seconds * 1000);
  }
  catch (final InterruptedException e) {
    /* This can happen when the refresh is happening from a navigation that started
     * from a setTimeout or setInterval. The navigation will cause all threads to get
     * interrupted, including the current thread in this case. It should be safe to
     * ignore it since this is the thread now doing the navigation. Eventually we should
     * refactor to force all navigation to happen back on the main thread.
     */
    if (LOG.isDebugEnabled()) {
      LOG.debug("Waiting thread was interrupted. Ignoring interruption to continue navigation.");
    }
  }
  final WebWindow window = page.getEnclosingWindow();
  if (window == null) {
    return;
  }
  final WebClient client = window.getWebClient();
  client.getPage(window, new WebRequest(url));
}

指令“client.getPage(window, new WebRequest(url))”再次调用 WebClient 来重新加载页面，只是再次调用这个相同的刷新方法。这似乎无限地进行，仅因为“Thread.sleep(seconds * 1000)”而没有快速填满内存，这会在重试之前强制等待 3m。

有没有人对我如何解决这个问题有任何建议？我建议创建 2 个新的 HtmlUnitDriver 和 WebClient 类来扩展原始类。然后重写相关方法以避免这个问题。

再次感谢。

score 4 · Accepted Answer

我通过创建一个什么都不做的 RefreshHandler 类解决了我的永恒刷新问题：

public class RefreshHandler implements com.gargoylesoftware.htmlunit.RefreshHandler {   
  public RefreshHandler() { }
  public void handleRefresh(final Page page, final URL url, final int secods) { }
}

此外，我扩展了 HtmlUnitDriver 类并通过覆盖方法 modifyWebClient，我设置了新的 RefreshHandler：

public class HtmlUnitDriverExt extends HtmlUnitDriver { 
  public HtmlUnitDriverExt(BrowserVersion version) {
    super(version);
  }
  @Override
  protected WebClient modifyWebClient(WebClient client) {
    client.setRefreshHandler(new RefreshHandler());
    return client;
  }
}

方法 modifyWebClient 是在 HtmlUnitDriver 中专门为此目的而创建的无操作方法。

干杯。

java - HtmlUnitDriver 在获取 url 时导致问题

1 回答 1

Related

Reference