2

我正在尝试进行一些刮擦此站点上进行一些抓取,以编程方式查找投票信息。我最初尝试使用 Python,它非常适合加载站点和浏览aspx表单,但无法提取嵌入的地图数据(因为没有包(到目前为止)处理 javascript)。因此,我选择抛开我的 Java 技能并打破 HtmlUnit。然而,我几乎立刻就遇到了障碍。

似乎网站上有一些不存在的指向 javascript 文件的死链接。当 HtmlUnit 尝试加载它们时,它会收到 404 并自毁。

具体错误

Jul 21, 2013 9:51:22 PM com.gargoylesoftware.htmlunit.html.HtmlPage loadExternalJavaScriptFile
SEVERE: Error loading JavaScript from [http://www.eci-polldaymonitoring.nic.in/psl/GoogleMapForASPNet.ascx/jsdebug].
com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException: 404 Not Found for http://www.eci-polldaymonitoring.nic.in/psl/GoogleMapForASPNet.ascx/jsdebug
    at com.gargoylesoftware.htmlunit.WebClient.throwFailingHttpStatusCodeExceptionIfNecessary(WebClient.java:544)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.loadJavaScriptFromUrl(HtmlPage.java:1119)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:1059)
    at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:399)
    at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:260)
    at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:276)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:676)
    at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:635)
    at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1170)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1072)
    at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:206)
    at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:330)
    at org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3074)
    at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2041)
    at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:918)
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:892)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:241)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:187)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:268)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:156)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:434)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:309)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:374)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:359)
    at ScrapeTest$.main(ScrapeTest.scala:12)
    at ScrapeTest.main(ScrapeTest.scala)

有没有办法告诉它(a)完全忽略 404 错误,或者(b)忽略特定的 javascript url?

到目前为止我的代码(Scala)

import com.gargoylesoftware.htmlunit.WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion
import com.gargoylesoftware.htmlunit.html.HtmlPage

object ScrapeTest {

  def main(args: Array[String]): Unit = {
    val pageurl = "http://www.eci-polldaymonitoring.nic.in/psl/"
    val client = new WebClient(BrowserVersion.INTERNET_EXPLORER_8)
    
    var response: HtmlPage = client.getPage(pageurl)
    
    println(response.asText())
  }
}
4

2 回答 2

11

简要查看 HtmlUnit JavaDoc 似乎表明您应该能够使用WebClientOptions#setExceptionOnFailingStatusCode(boolean)

例如,

import com.gargoylesoftware.htmlunit.WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion
import com.gargoylesoftware.htmlunit.html.HtmlPage

object ScrapeTest {

  def main(args: Array[String]): Unit = {
    val pageurl = "http://www.eci-polldaymonitoring.nic.in/psl/"
    val client = new WebClient(BrowserVersion.INTERNET_EXPLORER_8)
    // Don't throw exception on failing status code
    client.getOptions.setExceptionOnFailingStatusCode(false)

    var response: HtmlPage = client.getPage(pageurl)

    println(response.asText())
  }
}

如果这不起作用,您还可以尝试:

于 2013-07-22T02:52:06.647 回答
0

我有同样的问题。我不希望 HTMLUnit 请求外部链接。此外,我不想打印出 css/js 警告和所有噪音。

我配置了 HtmlUnit(使用 Spring WebApplicationContext):

@NoArgsConstructor(access = PRIVATE)
public final class _MockWebClientCreator {

  public static WebClient createWebClien(WebApplicationContext wac) {
    WebClient webClient = MockMvcWebClientBuilder.webAppContextSetup(wac).build();
    webClient.getOptions().setThrowExceptionOnScriptError(FALSE);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(FALSE);
    webClient.getOptions().setPrintContentOnFailingStatusCode(FALSE);
    webClient.setCssErrorHandler(new SilentCssErrorHandler());
    webClient.setJavaScriptErrorListener(new SilentJavaScriptErrorListener());
    webClient.setWebConnection(new WebConnectionWrapper(webClient) { // Use only internal urls
        @Override
        public WebResponse getResponse(WebRequest request) throws IOException {
            return (startsWith(request.getUrl().toString(), "http://localhost"))
                ? super.getResponse(request)
                : new StringWebResponse("", request.getUrl());
        }
    });
    webClient.setJavaScriptTimeout(Duration.ofSeconds(INTEGER_ONE).toMillis());
    return webClient;
  }
}
于 2020-07-08T14:23:16.283 回答