2

I am using Ruby to scrape webpages that sometimes return redirects which I want to follow. There is many Ruby gems that do that, but there is a problem:

Ruby's URI.parse explodes on some URIs that are technically invalid but work in browsers like "http://www.google.com/?q=<>"

URI.parse("http://www.google.com/?q=<>")               #=> error

require 'addressable/uri'
Addressable::URI.parse("http://www.google.com/?q=<>")  #=> works

All the HTTP client libraries I have tried (HttParty, Faraday, RestClient) break when they encounter such a URI in a redirect (this is on ruby 1.9.3)

rest-client:

require 'rest-client'
RestClient.get("http://bitly.com/ReeuYv") #=> explodes

faraday:

require 'faraday'
require 'faraday_middleware'
Faraday.use(FaradayMiddleware::FollowRedirects)
Faraday.get("http://bitly.com/ReeuYv")    #=> explodes

httparty:

require 'httparty'
HTTParty.get("http://bitly.com/ReeuYv")   # => explodes

open-uri:

require 'open-uri'
open("http://bitly.com/ReeuYv")           # => explodes

What can I do to make this work?

4

4 回答 4

3

Mechanize 是我最喜欢的网页抓取工具。

Mechanize 库用于自动与网站交互。Mechanize 自动存储和发送 cookie,跟踪重定向,并且可以跟踪链接和提交表单。可以填充和提交表单字段。Mechanize 还会记录您访问过的网站作为历史记录。

require 'mechanize'
agent = Mechanize.new
page = agent.get('http://bitly.com/ReeuYv')
puts page.uri.to_s
=> http://www.google.com/?q=%3C%3E

它使用 nokogiri 来解析 html,因此每个Mechanize::Page对象都可以被视为 nokogiri 对象,因此您可以获得 html 之类的部分

puts page.form('f').q
=> <>

最后一部分可能看起来像黑魔法,但你真的需要自己尝试pp page。它使 HTML 很容易被抓取。

是入门指南和文档。

于 2012-11-06T21:32:17.143 回答
2

台风作品:

require 'typhoeus'
Typhoeus::VERSION #=> "0.5.0.rc" 
Typhoeus.get("http://bitly.com/ReeuYv", followlocation: true).body
于 2012-11-06T20:13:34.240 回答
1

遏制似乎工作:

require 'curb'
Curl.get("http://bitly.com/ReeuYv") { |c| 
  c.follow_location = true 
}.body_str  #=>  works
于 2012-11-06T19:51:01.473 回答
0

这将起作用:

uri = URI.escape "http://www.google.com/?q=<>"


#=> "http://www.google.com/?q=%3C%3E"


URI.parse(uri) #=> no error
于 2012-11-06T20:02:06.240 回答