ruby-on-rails - work around Ruby's broken URI.parse, follow redirects

Question

I am using Ruby to scrape webpages that sometimes return redirects which I want to follow. There is many Ruby gems that do that, but there is a problem:

Ruby's URI.parse explodes on some URIs that are technically invalid but work in browsers like "http://www.google.com/?q=<>"

URI.parse("http://www.google.com/?q=<>")               #=> error

require 'addressable/uri'
Addressable::URI.parse("http://www.google.com/?q=<>")  #=> works

All the HTTP client libraries I have tried (HttParty, Faraday, RestClient) break when they encounter such a URI in a redirect (this is on ruby 1.9.3)

rest-client:

require 'rest-client'
RestClient.get("http://bitly.com/ReeuYv") #=> explodes

faraday:

require 'faraday'
require 'faraday_middleware'
Faraday.use(FaradayMiddleware::FollowRedirects)
Faraday.get("http://bitly.com/ReeuYv")    #=> explodes

httparty:

require 'httparty'
HTTParty.get("http://bitly.com/ReeuYv")   # => explodes

open-uri:

require 'open-uri'
open("http://bitly.com/ReeuYv")           # => explodes

What can I do to make this work?

score 3 · Accepted Answer

Mechanize 是我最喜欢的网页抓取工具。

Mechanize 库用于自动与网站交互。Mechanize 自动存储和发送 cookie，跟踪重定向，并且可以跟踪链接和提交表单。可以填充和提交表单字段。Mechanize 还会记录您访问过的网站作为历史记录。

require 'mechanize'
agent = Mechanize.new
page = agent.get('http://bitly.com/ReeuYv')
puts page.uri.to_s
=> http://www.google.com/?q=%3C%3E

它使用 nokogiri 来解析 html，因此每个Mechanize::Page对象都可以被视为 nokogiri 对象，因此您可以获得 html 之类的部分

puts page.form('f').q
=> <>

最后一部分可能看起来像黑魔法，但你真的需要自己尝试pp page。它使 HTML 很容易被抓取。

这是入门指南和文档。

score 2 · Accepted Answer

台风作品：

require 'typhoeus'
Typhoeus::VERSION #=> "0.5.0.rc" 
Typhoeus.get("http://bitly.com/ReeuYv", followlocation: true).body

score 1 · Accepted Answer

遏制似乎工作：

require 'curb'
Curl.get("http://bitly.com/ReeuYv") { |c| 
  c.follow_location = true 
}.body_str  #=>  works

score 0 · Accepted Answer

这将起作用：

uri = URI.escape "http://www.google.com/?q=<>"


#=> "http://www.google.com/?q=%3C%3E"


URI.parse(uri) #=> no error

ruby-on-rails - work around Ruby's broken URI.parse, follow redirects

rest-client:

faraday:

httparty:

open-uri:

4 回答 4

Related

Reference