1

我正在尝试从 csv 文件中读取种子 URL 列表,并使用以下代码将它们加载到爬网控制器中:

public class BasicCrawlController {

    public static void main(String[] args) throws Exception {

        ArrayList<String> sl = Globals.INSTANCE.getSeeds();
        System.out.println("Seeds to add: " + sl.size());
        for (int i = 0; i < sl.size(); i++) {
            String url = sl.get(i).toString();
            System.out.println("Adding to seed: " + url);
            controller.addSeed(url);
        }
    controller.start(BasicCrawler.class, numberOfCrawlers);
    }
}

我从控制台收到的输出如下:

Seeds to add: 3
Adding to seed: http://xxxxx.com
Adding to seed: http://yyyyy.com
Adding to seed: http://zzzzz.com
 INFO [main] Crawler 1 started.
 INFO [main] Crawler 2 started.
 INFO [main] Crawler 3 started.
 INFO [main] Crawler 4 started.
 INFO [main] Crawler 5 started.
 INFO [main] Crawler 6 started.
 INFO [main] Crawler 7 started.
 INFO [main] Crawler 8 started.
 INFO [main] Crawler 9 started.
 INFO [main] Crawler 10 started.
ERROR [Crawler 1] String index out of range: -8, while processing: http://yyyyy.com/
ERROR [Crawler 1] String index out of range: -8, while processing: http://zzzzz.com/
 INFO [Thread-2] It looks like no thread is working, waiting for 10 seconds to make sure...
 INFO [Thread-2] No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure...
 INFO [Thread-2] All of the crawlers are stopped. Finishing the process...
 INFO [Thread-2] Waiting for 10 seconds before final clean up...

在启动 controller.start 之前,我是否遗漏了一些允许动态添加种子的东西?

上面的代码中省略了爬虫数量的其余部分以及爬虫控制器中 crawler4j 的所有必要内容,以使其简短易读。

4

0 回答 0