ruby-on-rails - How do I convert a Nokogiri statement into Mechanize for screen scraping?

Question

I'm trying to use Mechanize to scape some tags from a page. I've used Nokogiri successfully to scrape them before, but now I'm trying to combine them into a wider Mechanize class. Here is the Nokogiri statement:

page = Nokogiri::HTML(open(@model.url, "User-Agent" => request.env['HTTP_USER_AGENT']))
@model.icons = page.css("link[rel='apple-touch-icon']").to_s

And here is what I thought would be the Mechanize equivalent but it's not working:

agent = Mechanize.new
page = agent.get(@model.url, "User-Agent" => request.env['HTTP_USER_AGENT'])
@model.icons = page.search("link[rel='apple-touch-icon']").to_s

The first one returns a link tag as expected <link rel="apple-touch-icon" etc etc..></link>. The second statement returns a blank string. If I take the to_s off the end I get a super long output. I assume it's an error or the actual Mechanize object or something.

Link to long output when not converting to string: https://gist.github.com/eadam/5583541

score 1 · Accepted Answer

如果没有示例 HTML，就很难重现问题，因此这是一些可能对您有所帮助的一般信息。

该“长输出”是您使用该方法inspect时获得的 Nokogiri::NodeSet 的输出。search如果search返回多个节点，或者节点有很多子节点，则inspect输出可以继续进行，但是，这就是它应该做的。

css并且search非常相似，因为它们返回一个 NodeSet。css假设传入的字符串是一个 CSS 访问器，虽然search更通用，并试图找出传入的是 CSS 还是 XPath 表达式。如果计算错误，则模式找到匹配项的可能性很小。您可以使用atorsearch来通用，让 Nokogiri 找出来， or at_css，at_xpathorcss和xpath分别替换它们。推导都返回第at一个匹配的节点，类似于 using search('some_path').first。

to_s将 NodeSet 转换回传入的源的表示形式。我更喜欢更明确地to_xml使用to_xhtml或to_html。

你为什么不像你search那样得到输出css？我不知道，因为我无法针对您正在解析的 HTML 进行测试。回答问题，如数据处理，是一种 GIGO 情况。

ruby-on-rails - How do I convert a Nokogiri statement into Mechanize for screen scraping?

1 回答 1

Related

Reference