我可以使用 Nokogiri抓取http://www.example.com/view-books/0/new-releases但如何抓取所有页面?这个有五页,但不知道最后一页我该如何继续?
这是我写的程序:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'
urls=Array['http://www.example.com/view-books/0/new-releases?layout=grid&_pop=flyout',
'http://www.example.com/view-books/1/bestsellers',
'http://www.example.com/books/pre-order?query=book&cid=1&layout=list&ref=4b116001-01a6-4f53-8da7-945b74fdb253'
]
@titles=Array.new
@prices=Array.new
@descriptions=Array.new
@page=Array.new
urls.each do |url|
doc=Nokogiri::HTML(open(url))
puts doc.at_css("title").text
doc.css('.fk-inf-scroll-item').each do |item|
@prices << item.at_css(".final-price").text
@titles << item.at_css(".fk-srch-title-text").text
@descriptions << item.at_css(".fk-item-specs-section").text
@page << item.at_css(".fk-inf-pageno").text rescue nil
end
(0..@prices.length - 1).each do |index|
puts "title: #{@titles[index]}"
puts "price: #{@prices[index]}"
puts "description: #{@descriptions[index]}"
# puts "pageno. : #{@page[index]}"
puts ""
end
end
CSV.open("result.csv", "wb") do |row|
row << ["title", "price", "description","pageno"]
(0..@prices.length - 1).each do |index|
row << [@titles[index], @prices[index], @descriptions[index],@page[index]]
end
end
如您所见,我已经对 URL 进行了硬编码。你如何建议我刮掉整个书籍类别?我正在尝试海葵,但无法让它发挥作用。