java - 如何使用 crawler4j 进行抓取？

Question

我已经做了 4 个小时了，我根本看不出我做错了什么。我有两个文件：

MyCrawler.java
控制器.java

MyCrawler.java

import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
import java.util.List;
import java.util.regex.Pattern;
import org.apache.http.Header;

public class MyCrawler extends WebCrawler {

    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4"
                    + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

    /**
     * You should implement this function to specify whether the given url
     * should be crawled or not (based on your crawling logic).
     */
    @Override
    public boolean shouldVisit(WebURL url) {
            String href = url.getURL().toLowerCase();
            return !FILTERS.matcher(href).matches() && href.startsWith("http://www.ics.uci.edu/");
    }

    /**
     * This function is called when a page is fetched and ready to be processed
     * by your program.
     */
    @Override
    public void visit(Page page) {
            int docid = page.getWebURL().getDocid();
            String url = page.getWebURL().getURL();
            String domain = page.getWebURL().getDomain();
            String path = page.getWebURL().getPath();
            String subDomain = page.getWebURL().getSubDomain();
            String parentUrl = page.getWebURL().getParentUrl();
            String anchor = page.getWebURL().getAnchor();

            System.out.println("Docid: " + docid);
            System.out.println("URL: " + url);
            System.out.println("Domain: '" + domain + "'");
            System.out.println("Sub-domain: '" + subDomain + "'");
            System.out.println("Path: '" + path + "'");
            System.out.println("Parent page: " + parentUrl);
            System.out.println("Anchor text: " + anchor);

            if (page.getParseData() instanceof HtmlParseData) {
                    HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
                    String text = htmlParseData.getText();
                    String html = htmlParseData.getHtml();
                    List<WebURL> links = htmlParseData.getOutgoingUrls();

                    System.out.println("Text length: " + text.length());
                    System.out.println("Html length: " + html.length());
                    System.out.println("Number of outgoing links: " + links.size());
            }

            Header[] responseHeaders = page.getFetchResponseHeaders();
            if (responseHeaders != null) {
                    System.out.println("Response headers:");
                    for (Header header : responseHeaders) {
                            System.out.println("\t" + header.getName() + ": " + header.getValue());
                    }
            }

            System.out.println("=============");
    }
}

控制器.java

package edu.crawler;

import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
import java.util.List;
import java.util.regex.Pattern;

import org.apache.http.Header;

import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;

public class Controller 
{

    public static void main(String[] args) throws Exception 
    {
            String crawlStorageFolder = "../data/";
            int numberOfCrawlers = 7;

            CrawlConfig config = new CrawlConfig();
            config.setCrawlStorageFolder(crawlStorageFolder);

            /*
             * Instantiate the controller for this crawl.
             */
            PageFetcher pageFetcher = new PageFetcher(config);
            RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
            RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
            CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

            /*
             * For each crawl, you need to add some seed urls. These are the first
             * URLs that are fetched and then the crawler starts following links
             * which are found in these pages
             */
            controller.addSeed("http://www.ics.uci.edu/~welling/");
            controller.addSeed("http://www.ics.uci.edu/~lopes/");
            controller.addSeed("http://www.ics.uci.edu/");

            /*
             * Start the crawl. This is a blocking operation, meaning that your code
             * will reach the line after this only when crawling is finished.
             */
            controller.start(MyCrawler, numberOfCrawlers);
    }
}

结构如下：

java/MyCrawler.java
java/Controller.java
jars/... --> all the jars crawler4j

我尝试使用以下方法在 WINDOWS 机器上编译它：

javac -cp "C:\xampp\htdocs\crawlcrowd\www\java\jars\*;C:\xampp\htdocs\crawlcrowd\www\java\*" MyCrawler.java

这非常有效，我最终得到：

java/MyCrawler.class

但是，当我输入：

javac -cp "C:\xampp\htdocs\crawlcrowd\www\java\jars\*;C:\xampp\htdocs\crawlcrowd\www\java\*" Controller.java

它爆炸了：

Controller.java:50: error: cannot find symbol
            controller.start(MyCrawler, numberOfCrawlers);
                             ^
  symbol:   variable MyCrawler
  location: class Controller
1 error

所以，我认为不知何故我没有做我需要做的事情。使这个新的可执行类“意识到”MyCrawler.class 的东西。我曾尝试在命令行 javac 部分中摆弄类路径。我也尝试在我的环境变量中设置它......没有运气。

知道我怎样才能让它工作吗？

更新

我从 Google 代码页面本身获得了大部分代码。但我就是不知道必须去那里。即使我尝试这个：

MyCrawler mc = new MyCrawler();

没运气。不知何故，Controller.class 不知道 MyCrawler.class。

更新 2

我认为这并不重要，因为问题显然是找不到类，但无论哪种方式，这里都是“CrawlController 控制器”的签名。取自这里。

   /**
     * Start the crawling session and wait for it to finish.
     * 
     * @param _c
     *            the class that implements the logic for crawler threads
     * @param numberOfCrawlers
     *            the number of concurrent threads that will be contributing in
     *            this crawling session.
     */
    public <T extends WebCrawler> void start(final Class<T> _c, final int numberOfCrawlers) {
            this.start(_c, numberOfCrawlers, true);
    }

实际上，当我通过“MyCrawler”时，我正在通过“爬虫”。问题是应用程序不知道 MyCrawler 是什么。

score 1 · Accepted Answer

我想到了几件事：

您的 MyCrawler 是否扩展了 edu.uci.ics.crawler4j.crawler.WebCrawler？
```
public class MyCrawler extends WebCrawler
```
您是否将 MyCrawler.class（即，作为一个类）传递给 controller.start？
```
controller.start(MyCrawler.class, numberOfCrawlers);
```

这两个都需要满足，控制器才能编译和运行。此外，Crawler4j 在这里有一些很好的例子：

https://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/basic/BasicCrawler.java

https://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/basic/BasicCrawlController.java

这 2 个类将立即编译和运行（即 BasicCrawlController），因此如果您遇到任何问题，这是一个很好的起点。

score 0 · Accepted Answer

start() 的参数应该是爬虫的类别和数量。当您传入爬虫对象而不是爬虫类时，它会引发错误。使用如下所示的启动方法，它应该可以工作

controller.start(MyCrawler.class, numberOfCrawlers)

score -1 · Accepted Answer

在这里，您将类名MyCrawler作为参数传递。

controller.start(MyCrawler, numberOfCrawlers);

我认为类名不应该是参数。

我也在爬行！

java - 如何使用 crawler4j 进行抓取？

3 回答 3

Related

Reference