0

我要索引的网站相当大,有 1 万页。我真的只想要一个包含所有 URL 的 json 文件,这样我就可以对它们运行一些操作(排序、分组等)。

基本的海葵循环运行良好:

require 'anemone'

Anemone.crawl("http://www.example.com/") do |anemone|
  anemone.on_every_page do |page|
      puts page.url
  end
end

但是(因为站点大小?)终端在一段时间后冻结了。因此,我安装了 MongoDB 并使用了以下

require 'rubygems'
require 'anemone'
require 'mongo'
require 'json'


$stdout = File.new('sitemap.json','w')


Anemone.crawl("http://www.mybigexamplesite.com/") do |anemone|
  anemone.storage = Anemone::Storage.MongoDB
  anemone.on_every_page do |page|
      puts page.url
  end
end

它现在正在运行,但是当我早上回来时,如果 json 文件中有输出,我会感到非常惊讶 - 我以前从未使用过 MongoDB,而且海葵文档中关于使用存储的部分并不清楚(对我来说至少)。之前做过这件事的人可以给我一些提示吗?

4

2 回答 2

3

如果有人需要 <= 100,000 个 URL,Ruby Gem Spidr是一个不错的选择。

于 2013-08-27T19:53:35.927 回答
2

This is probably not the answer you wanted to see but I highly advice that you don't use Anemone and perhaps Ruby for that matter for crawling a million pages.

Anemone is not a maintained library and fails on many edge cases.

Ruby is not the fastest language and uses a global interpreter lock which means that you can't have true threading capabilities. I think your crawling will probably be too slow. For more information about threading, I suggest you can check out the following links.

http://ablogaboutcode.com/2012/02/06/the-ruby-global-interpreter-lock/

Does ruby have real multithreading?

You can try using anemone with Rubinius or JRuby which are much faster with but I'm not sure the extent of compatibility.

I had some mild success going from Anemone to Nutch but your mileage may vary.

于 2013-08-21T23:50:31.513 回答