ruby-on-rails - rails中的网络爬虫，如何爬取网站的所有页面

Question

我需要从给定域的所有页面中获取所有 url，
我认为使用后台作业是有意义的，将它们放在多个队列
中尝试使用cobweb但它似乎非常令人困惑 gem，
并且anomone， anemone 工作了很长时间如果有很多页面

require 'anemone'

Anemone.crawl("http://www.example.com/") do |anemone|
  anemone.on_every_page do |page|
      puts page.links
  end
end

你认为什么最适合我？

score 2 · Accepted Answer

可以使用NutchCrawler，Apache Nutch是一个高度可扩展和可扩展的开源网络爬虫软件项目。

1 回答 1