我正在尝试从 csv 文件中读取种子 URL 列表,并使用以下代码将它们加载到爬网控制器中:
public class BasicCrawlController {
public static void main(String[] args) throws Exception {
ArrayList<String> sl = Globals.INSTANCE.getSeeds();
System.out.println("Seeds to add: " + sl.size());
for (int i = 0; i < sl.size(); i++) {
String url = sl.get(i).toString();
System.out.println("Adding to seed: " + url);
controller.addSeed(url);
}
controller.start(BasicCrawler.class, numberOfCrawlers);
}
}
我从控制台收到的输出如下:
Seeds to add: 3
Adding to seed: http://xxxxx.com
Adding to seed: http://yyyyy.com
Adding to seed: http://zzzzz.com
INFO [main] Crawler 1 started.
INFO [main] Crawler 2 started.
INFO [main] Crawler 3 started.
INFO [main] Crawler 4 started.
INFO [main] Crawler 5 started.
INFO [main] Crawler 6 started.
INFO [main] Crawler 7 started.
INFO [main] Crawler 8 started.
INFO [main] Crawler 9 started.
INFO [main] Crawler 10 started.
ERROR [Crawler 1] String index out of range: -8, while processing: http://yyyyy.com/
ERROR [Crawler 1] String index out of range: -8, while processing: http://zzzzz.com/
INFO [Thread-2] It looks like no thread is working, waiting for 10 seconds to make sure...
INFO [Thread-2] No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure...
INFO [Thread-2] All of the crawlers are stopped. Finishing the process...
INFO [Thread-2] Waiting for 10 seconds before final clean up...
在启动 controller.start 之前,我是否遗漏了一些允许动态添加种子的东西?
上面的代码中省略了爬虫数量的其余部分以及爬虫控制器中 crawler4j 的所有必要内容,以使其简短易读。