0

我正在尝试同时使用 Crawler4j 和 Selenium 进行一些网站测试。爬取网页后,Selenium 应该使用从爬虫获取的参数同时开始测试。例如,这将是他应该打开的 URL 或搜索字段的 ID。如果我单独使用 Crawler4j,它工作正常,我可以提取我需要的信息。如果我只使用带有预定义参数(如 URL 和 Id)的 Selenium 进行测试,它也可以正常工作。但是当我将相同的 Selenium 代码放入 Crawler Code 时,我总是会遇到这个异常。我的猜测是这可能是一个威胁问题?如果有人能给我提示或帮助我,那就太好了:

Exception in thread "Crawler 1" java.lang.NoClassDefFoundError: com/google/common/base/Function
    at myPackage.com.Selenium.init(Selenium.java:21)
    at myPackage.com.MyCrawler.visit(MyCrawler.java:57)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:351)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:220)
    at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.ClassNotFoundException: com.google.common.base.Function
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
    ... 5 more

这是我的爬虫代码:

public class MyCrawler extends WebCrawler {


    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4"
            + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

/**
* You should implement this function to specify whether the given url
* should be crawled or not (based on your crawling logic).
*/
    @Override
    public boolean shouldVisit(WebURL url) {
        String href = url.getURL().toLowerCase();
        return !FILTERS.matcher(href).matches() && href.startsWith("http://");
                                            }


     /**
     * This function is called when a page is fetched and ready to be processed
     * by your program.
     */
    @Override
    public void visit(Page page) {
            int docid = page.getWebURL().getDocid();
            String url = page.getWebURL().getURL();
            String domain = page.getWebURL().getDomain();
            String path = page.getWebURL().getPath();
            String subDomain = page.getWebURL().getSubDomain();
            String parentUrl = page.getWebURL().getParentUrl();
            String anchor = page.getWebURL().getAnchor();

            System.out.println("Docid: " + docid);
            System.out.println("URL: " + url);
            System.out.println("Domain: '" + domain + "'");
            System.out.println("Sub-domain: '" + subDomain + "'");
            System.out.println("Path: '" + path + "'");
            System.out.println("Parent page: " + parentUrl);
            System.out.println("Anchor text: " + anchor);

           // here a Webbrowser should open but it don`t work

           WebDriver driver = new FirefoxDriver();
       driver.get(url);





            if (page.getParseData() instanceof HtmlParseData) {
                    HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
                    String text = htmlParseData.getText();
                    String html = htmlParseData.getHtml();
                    List<WebURL> links = htmlParseData.getOutgoingUrls();

                    System.out.println("Text length: " + text.length());
                    System.out.println("Html length: " + html.length());
                    System.out.println("Number of outgoing links: " + links.size());
                }





            Header[] responseHeaders = page.getFetchResponseHeaders();
            if (responseHeaders != null) {
                    System.out.println("Response headers:");
                    for (Header header : responseHeaders) {
                            System.out.println("\t" + header.getName() + ": " + header.getValue());
                    }
            }

            System.out.println("=============");

    }
4

1 回答 1

0

我使用了 httpcore-4.2.2 和 httpclient-4.2.3。将它与 Selenium 一起使用一定有问题。在我将它们更新到最新版本后,它就可以正常工作了。为找到这个而失去了将近一周的时间。

于 2014-06-14T12:51:31.173 回答