ruby-on-rails - Nokogiri 和 Mechanize 帮助（单击 Nokogiri 通过 Mechanize 找到的链接）

Question

我通过css表单搜索链接page = agent.get('http://www.print-index.ru/default.aspx?p=81&gr=198')，之后我在页面变量中有很多链接，但我不知道如何使用它们，如何通过Mechanize点击它们。我在stackoverflow上找到了这个方法：

page = agent.get "http://google.com"
node = page.search ".//p[@class='posted']"
Mechanize::Page::Link.new(node, agent, page).click

但它仅适用于一个链接，所以我如何将这种方法用于许多链接。

如果我应该发布更多信息，请说出来。

score 2 · Accepted Answer

如果您的目标只是进入下一页，然后从中刮掉一些信息，那么您真正关心的是：

页面内容（用于抓取您的数据）
您需要访问的下一页的 URL

访问页面内容的方式可以通过使用MechanizeOR 其他东西来完成，例如OpenURI（它是 Ruby 标准库的一部分）。作为旁注，Mechanize在幕后使用 Nokogiri；当您开始深入分析页面上的元素时，您会看到它们以 Nokogiri 相关对象的形式返回。

无论如何，如果这是我的项目，我可能会使用OpenURI获取页面内容然后Nokogiri搜索它的路线。我喜欢使用 Ruby 标准库而不是需要额外的依赖项的想法。

这是一个使用示例OpenURI：

require 'nokogiri'
require 'open-uri'

printing_page = Nokogiri::HTML(open("http://www.print-index.ru/default.aspx?p=81&gr=198"))

# ...
# Your code to scrape whatever you want from the Printing Page goes here
# ...

# Find the next page to visit.  Example: You want to visit the "About the project" page next
about_project_link_in_navbar_menu = printing_page.css('a.graymenu')[4] # This is a overly simple finder. Nokogiri can do xpath searches too.
about_project_link_in_navbar_menu_url = "http://www.print-index.ru#{about_project_link_in_navbar_menu.attributes["href"].value}" # Get the URL page

about_project_page = Nokogiri::HTML(open(about_project_link_in_navbar_menu_url)) # Get the About page's content

# ....
# Do something...
# ....

这是一个Mechanize用于获取页面内容的示例（它们非常相似）：

require 'mechanize'

agent = Mechanize.new
printing_page = agent.get("http://www.print-index.ru/default.aspx?p=81&gr=198")

# ...
# Your code to scrape whatever you want from the Printing Page goes here
# ...

# Find the next page to visit.  Example: You want to visit the "About the project" page next
about_project_link_in_navbar_menu = printing_page.search('a.graymenu')[4] # This is a overly simple finder. Nokogiri can do xpath searches too.
about_project_link_in_navbar_menu_url = "http://www.print-index.ru#{about_project_link_in_navbar_menu.attributes["href"].value}" # Get the URL page

about_project_page = agent.get(about_project_link_in_navbar_menu_url)

# ....
# Do something...
# ....

PS我用谷歌将俄语翻译成英语..如果变量名不正确，我很抱歉！：X

ruby-on-rails - Nokogiri 和 Mechanize 帮助（单击 Nokogiri 通过 Mechanize 找到的链接）

1 回答 1

Related

Reference