3

I'm navigating through a site with HtmlUnit. It has a table, with a list of document for download. I want to click all the links and gather all the documents (don't worry, the information is public and scraping is not forbidden).

The site is written with JSF, so the links to the documents are actually <a href="#" with onclick that submits the form (but sets a hidden field to the appropriate value before that).

My code is (in scala, but that doesn't matter):

val link = row.getFirstByXPath[HtmlElement](descriptor.documentLinkPath.get)
if (link.getAttribute("href").endsWith("#")) link.setAttribute("href", "javascript:void(0)")
val documentPage: Page = link.click()
val bytes = IOUtils.toByteArray(documentPage.getWebResponse().getContentAsStream())

There's a problem, however. The first document is downloaded properly. But I can't get the 2nd one and onwards - the html page is returned, rather than the PDF document. (commenting out the # -> javascript:void(0) has no effect, I put it there because it used to blow up with some exception)

Javascript is enabled and getting it to work for the first document means that things are generally working. However, it doesn't work for the next documents. Any ideas how to resolve?

4

2 回答 2

2

如果没有页面重新加载,我也无法做到这一点。我认为诀窍是只从 ononclick()属性执行 JavaScript。

这个:

return oamSubmitForm('broi_form','broi_form:dataTable1:4:_idJsp110',null,[['id_','3545']]);');

也许这对你有帮助。

public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException
{
    final WebClient webClient = new WebClient();

    HtmlPage page = webClient.getPage("http://dv.parliament.bg/DVWeb/broeveList.faces");

    for (HtmlAnchor link : (List<HtmlAnchor>) page.getByXPath("//table[@id='broi_form:dataTable1']//a/img/.."))
    {
        String commandString = link.getOnClickAttribute().replaceAll("return ", "");
        System.out.println(commandString);

        ScriptResult executeJavaScript = page.executeJavaScript(commandString);

        Page newPage = executeJavaScript.getNewPage();
        save(newPage.getWebResponse().getContentAsStream());

        page = webClient.getPage("http://dv.parliament.bg/DVWeb/broeveList.faces");
    }

}

但这不是正确的做法......

于 2013-08-11T16:37:40.377 回答
0

每次下载后这对我有用:

page = (HtmlPage) page.refresh();
于 2015-03-11T10:17:41.327 回答