我正在使用从 APOD(每日天文图片)中抓取数据的 ruby 种子文件。既然有成千上万的条目,有没有办法限制抓取只拉过去的 365 图像?
这是我正在使用的种子代码:
require 'rubygems'
require 'open-uri'
require 'open-uri'
require 'nokogiri'
require 'curl'
require 'fileutils'
BASE = 'http://antwrp.gsfc.nasa.gov/apod/'
FileUtils.mkdir('small') unless File.exist?('small')
FileUtils.mkdir('big') unless File.exist?('big')
f = open 'http://antwrp.gsfc.nasa.gov/apod/archivepix.html'
html_doc = Nokogiri::HTML(f.read)
html_doc.xpath('//b//a').each do |element|
imgurl = BASE + element.attributes['href'].value
doc = Nokogiri::HTML(open(imgurl).read)
doc.xpath('//p//a//img').each do |elem|
small_img = BASE + elem.attributes['src'].value
big_img = BASE + elem.parent.attributes['href'].value
s_img_f = open("small/#{File.basename(small_img)}", 'wb')
b_img_f = open("big/#{File.basename(big_img)}", 'wb')
rs_img = Curl::Easy.new(small_img)
rb_img = Curl::Easy.new(big_img)
rs_img.perform
s_img_f.write(rs_img.body_str)
rb_img.perform
b_img_f.write(rb_img.body_str)
s_img_f.close
puts "Download #{File.basename(small_img)} finished."
b_img_f.close
puts "Download #{File.basename(big_img)} finished."
rs_img.close
rb_img.close
end
end
puts "All done."