java - 使用 java 的 Web 抓取（启用 Ajax/JavaScript 的页面）

Question

我对这个网络爬行很陌生。我正在使用crawler4j来抓取网站。我通过爬取这些网站来收集所需的信息。我的问题是我无法抓取以下网站的内容。http://www.sciencedirect.com/science/article/pii/S1568494612005741。我想从上述网站抓取以下信息（请看随附的屏幕截图）。

在此处输入图像描述

如果您观察随附的屏幕截图，它具有三个名称（以红色框突出显示）。如果您单击其中一个链接，您将看到一个弹出窗口，该弹出窗口包含有关该作者的全部信息。我想抓取该弹出窗口中的信息。

我正在使用以下代码来抓取内容。

public class WebContentDownloader {

private Parser parser;
private PageFetcher pageFetcher;

public WebContentDownloader() {
    CrawlConfig config = new CrawlConfig();
    parser = new Parser(config);
    pageFetcher = new PageFetcher(config);
}

private Page download(String url) {
    WebURL curURL = new WebURL();
    curURL.setURL(url);
    PageFetchResult fetchResult = null;
    try {
        fetchResult = pageFetcher.fetchHeader(curURL);
        if (fetchResult.getStatusCode() == HttpStatus.SC_OK) {
            try {
                Page page = new Page(curURL);
                fetchResult.fetchContent(page);
                if (parser.parse(page, curURL.getURL())) {
                    return page;
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    } finally {
        if (fetchResult != null) {
            fetchResult.discardContentIfNotConsumed();
        }
    }
    return null;
}

private String processUrl(String url) {
    System.out.println("Processing: " + url);
    Page page = download(url);
    if (page != null) {
        ParseData parseData = page.getParseData();
        if (parseData != null) {
            if (parseData instanceof HtmlParseData) {
                HtmlParseData htmlParseData = (HtmlParseData) parseData;
                return htmlParseData.getHtml();
            }
        } else {
            System.out.println("Couldn't parse the content of the page.");
        }
    } else {
        System.out.println("Couldn't fetch the content of the page.");
    }
    return null;
}

public String getHtmlContent(String argUrl) {
    return this.processUrl(argUrl);
}
}

我能够从上述链接/站点抓取内容。但它没有我在红框中标记的信息。我认为这些是动态链接。

我的问题是如何从上述链接/网站抓取内容...？？？
如何从基于 Ajax/JavaScript 的网站中抓取内容......？？？

请任何人都可以帮助我。

谢谢和问候，阿马尔

score 6 · Accepted Answer

嗨，我找到了另一个库的解决方法。我使用 Selinium WebDriver (org.openqa.selenium.WebDriver)库来提取动态内容。这是示例代码。

public class CollectUrls {

private WebDriver driver;

public CollectUrls() {
    this.driver = new FirefoxDriver();
    this.driver.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS);
}

protected void next(String url, List<String> argUrlsList) {
    this.driver.get(url);
    String htmlContent = this.driver.getPageSource();
}

这里“ htmlContent ”是必需的。如果您遇到任何问题，请告诉我...？？？

谢谢，阿马尔

score 5 · Accepted Answer

简单来说，Crawler4j 就是静态爬虫。这意味着它无法解析页面上的 JavaScript。因此，无法通过抓取您提到的特定页面来获取您想要的内容。当然，有一些解决方法可以让它工作。

如果您只想抓取此页面，则可以使用连接调试器。查看此问题以获取一些工具。找出 AJAX 请求调用的页面，然后抓取该页面。

如果您有各种具有动态内容（JavaScript/ajax）的网站，您应该考虑使用支持动态内容的爬虫，例如Crawljax（也是用 Java 编写的）。

score 1 · Accepted Answer

I have find out the Solution of Dynamic Web page Crawling using Aperture and Selenium.Web Driver.
Aperture is Crawling Tools and Selenium is Testing Tools which can able to rendering Inspect Element. 

1. Extract the Aperture- core Jar file by Decompiler Tools and Create a Simple Web Crawling Java program. (https://svn.code.sf.net/p/aperture/code/aperture/trunk/)
2. Download Selenium. WebDriver Jar Files and Added to Your Program.
3. Go to CreatedDataObjec() method in org.semanticdesktop.aperture.accessor.http.HttpAccessor.(Aperture Decompiler).
Added Below Coding 

   WebDriver driver = new FirefoxDriver();
   String baseurl=uri.toString();
   driver.get(uri.toString());
   String str = driver.getPageSource();
        driver.close();
 stream= new ByteArrayInputStream(str.getBytes());

java - 使用 java 的 Web 抓取（启用 Ajax/JavaScript 的页面）

3 回答 3

Related

Reference