我正在尝试获取并解析页面“ http://www.ean-search.org/sitemap.html ”,但它总是出现 404 错误和空白页面。所有文本内容区域均为空白。
我尝试了 HtmlUnit webclient 的许多选项配置,例如 .setThrowExceptionOnFailingStatusCode(false)、setThrowExceptionOnScriptError(true)、setRedirectEnabled(false)、setJavaScriptEnabled(true)、setThrowExceptionOnScriptError(false)。他们都没有工作...
有人有什么建议吗?谢谢。
ps:我的网络客户端代码:
myWebClient = new WebClient(BrowserVersion.FIREFOX_3_6);
myWebClient.setIncorrectnessListener(new CustomizedInconnectnessListener());
myWebClient.setTimeout(180000); //3 min, used twice, first for connection, second for retrieval
try {
myWebClient.setUseInsecureSSL(true);
} catch (GeneralSecurityException ex) {
logger.log(Level.SEVERE, "cannot set UseInsecureSSL for BNP webclient",ex);
//ignore it, continue
}
myWebClient.setRedirectEnabled(true);
myWebClient.setCssEnabled(false);
myWebClient.setJavaScriptTimeout(30000); //timeout for executing java script
myWebClient.setThrowExceptionOnScriptError(false);
HtmlPage htmlpage = (HtmlPage) myWebClient.getHtmlPage("http://www.ean-search.org/sitemap.html");
myWebClient.waitForBackgroundJavaScriptStartingBefore(3000);
Thread.sleep(3000);
System.out.println(htmlpage.asXml());