java - 爬行后如何减少/更改延迟？

Question

有人有使用 Crawler4j 的经验吗？

我按照项目页面中的示例实现了自己的爬虫。爬虫工作正常，爬得很快。唯一的问题是我总是有 20-30 秒的延迟。有没有办法避免等待时间？

score 2 · Accepted Answer

刚刚检查了 crawler4j源代码。CrawerController.start方法有很多固定的 10 秒“暂停”，以确保线程已完成并准备好被清理。

// Make sure again that none of the threads
// are
// alive.
logger.info("It looks like no thread is working, waiting for 10 seconds to make sure...");
sleep(10);

// ... more code ...

logger.info("No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure...");
sleep(10);

// ... more code ...

logger.info("Waiting for 10 seconds before final clean up...");
sleep(10);

此外，主循环每 10 秒检查一次以了解爬取线程是否已完成：

while (true) {
    sleep(10);
    // code to check if some thread is still working
}

protected void sleep(int seconds) {
   try {
       Thread.sleep(seconds * 1000);
   } catch (Exception ignored) {
   }
}

因此，微调这些呼叫并减少睡眠时间可能是值得的。

如果您可以腾出一些时间，更好的解决方案是重写此方法。我会用 ExecutorService 替换List<Thread> threads它，它的awaitTermination方法会特别方便。与睡眠不同，awaitTermination(10, TimeUnit.SECONDS)如果所有任务完成，将立即返回。

java - 爬行后如何减少/更改延迟？

1 回答 1

Related

Reference