如果您的目标只是进入下一页,然后从中刮掉一些信息,那么您真正关心的是:
- 页面内容(用于抓取您的数据)
- 您需要访问的下一页的 URL
访问页面内容的方式可以通过使用Mechanize
OR 其他东西来完成,例如OpenURI
(它是 Ruby 标准库的一部分)。作为旁注,Mechanize
在幕后使用 Nokogiri;当您开始深入分析页面上的元素时,您会看到它们以 Nokogiri 相关对象的形式返回。
无论如何,如果这是我的项目,我可能会使用OpenURI
获取页面内容然后Nokogiri
搜索它的路线。我喜欢使用 Ruby 标准库而不是需要额外的依赖项的想法。
这是一个使用示例OpenURI
:
require 'nokogiri'
require 'open-uri'
printing_page = Nokogiri::HTML(open("http://www.print-index.ru/default.aspx?p=81&gr=198"))
# ...
# Your code to scrape whatever you want from the Printing Page goes here
# ...
# Find the next page to visit. Example: You want to visit the "About the project" page next
about_project_link_in_navbar_menu = printing_page.css('a.graymenu')[4] # This is a overly simple finder. Nokogiri can do xpath searches too.
about_project_link_in_navbar_menu_url = "http://www.print-index.ru#{about_project_link_in_navbar_menu.attributes["href"].value}" # Get the URL page
about_project_page = Nokogiri::HTML(open(about_project_link_in_navbar_menu_url)) # Get the About page's content
# ....
# Do something...
# ....
这是一个Mechanize
用于获取页面内容的示例(它们非常相似):
require 'mechanize'
agent = Mechanize.new
printing_page = agent.get("http://www.print-index.ru/default.aspx?p=81&gr=198")
# ...
# Your code to scrape whatever you want from the Printing Page goes here
# ...
# Find the next page to visit. Example: You want to visit the "About the project" page next
about_project_link_in_navbar_menu = printing_page.search('a.graymenu')[4] # This is a overly simple finder. Nokogiri can do xpath searches too.
about_project_link_in_navbar_menu_url = "http://www.print-index.ru#{about_project_link_in_navbar_menu.attributes["href"].value}" # Get the URL page
about_project_page = agent.get(about_project_link_in_navbar_menu_url)
# ....
# Do something...
# ....
PS我用谷歌将俄语翻译成英语..如果变量名不正确,我很抱歉!:X