ruby - 使用 anemone gem 获取所有 URL（非常大的站点）

Question

我要索引的网站相当大，有 1 万页。我真的只想要一个包含所有 URL 的 json 文件，这样我就可以对它们运行一些操作（排序、分组等）。

基本的海葵循环运行良好：

require 'anemone'

Anemone.crawl("http://www.example.com/") do |anemone|
  anemone.on_every_page do |page|
      puts page.url
  end
end

但是（因为站点大小？）终端在一段时间后冻结了。因此，我安装了 MongoDB 并使用了以下

require 'rubygems'
require 'anemone'
require 'mongo'
require 'json'


$stdout = File.new('sitemap.json','w')


Anemone.crawl("http://www.mybigexamplesite.com/") do |anemone|
  anemone.storage = Anemone::Storage.MongoDB
  anemone.on_every_page do |page|
      puts page.url
  end
end

它现在正在运行，但是当我早上回来时，如果 json 文件中有输出，我会感到非常惊讶 - 我以前从未使用过 MongoDB，而且海葵文档中关于使用存储的部分并不清楚（对我来说至少）。之前做过这件事的人可以给我一些提示吗？

score 3 · Accepted Answer

3

如果有人需要 <= 100,000 个 URL，Ruby Gem Spidr是一个不错的选择。

于 2013-08-27T19:53:35.927 回答

score 2 · Accepted Answer

This is probably not the answer you wanted to see but I highly advice that you don't use Anemone and perhaps Ruby for that matter for crawling a million pages.

Anemone is not a maintained library and fails on many edge cases.

Ruby is not the fastest language and uses a global interpreter lock which means that you can't have true threading capabilities. I think your crawling will probably be too slow. For more information about threading, I suggest you can check out the following links.

http://ablogaboutcode.com/2012/02/06/the-ruby-global-interpreter-lock/

Does ruby have real multithreading?

You can try using anemone with Rubinius or JRuby which are much faster with but I'm not sure the extent of compatibility.

I had some mild success going from Anemone to Nutch but your mileage may vary.

ruby - 使用 anemone gem 获取所有 URL（非常大的站点）

2 回答 2

Related

Reference