2

我正在尝试做一些网络抓取,但 WWW:Mechanize gem 似乎不喜欢编码和崩溃。
发布请求导致 302 重定向(随后是机械化,到目前为止一切都很好),结果页面似乎崩溃了。我用谷歌搜索了很多,但到目前为止还没有出现如何解决这个问题。大家有什么想法吗?

代码:

require 'rubygems'
require 'mechanize'

agent = WWW::Mechanize.new

agent.user_agent_alias = 'Mac Safari'
answer = agent.post('https://www.budget.de/de/reservierung/privatkunden/step1/schnellbuchung',
{"Country" => "Deutschland",
"Abholstation" => "Aalen",
"Abgabestation" => "Aalen",
"Abholdatum" => "26.02.2009",
"Abholzeit_stunde" => "13",
"Abholzeit_minute" => "30",
"Abgabedatum" => "28.02.2009",
"Abgabezeit_stunde" => "13",
"Abgabezeit_minute" => "30",
"CountryID" => "DE",
"AbholstationID"=>"AA1",
"AbgabestationID"=>"AA1"
}
)
puts answer.body

错误:

D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/util.rb:29:in `iconv': "\204nderungen vorbe"... (Iconv::IllegalSequence)
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/util.rb:29:in `to_native_charset'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/response_header_handler.rb:29:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:30:in `pass'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/handler.rb:6:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/response_body_parser.rb:35:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:30:in `pass'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/handler.rb:6:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/pre_connect_hook.rb:14:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:25:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:494:in `fetch_page'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:545:in `fetch_page'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:403:in `post_form'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:322:in `post'
from test.rb:7
4

2 回答 2

3

该页面肯定是 UTF-8,但是 Mechanize 使用 NKF(一个核心 Ruby 库)来猜测编码,并且由于某种原因它作为 Shift JIS 出现。解决此问题的最快方法是覆盖 Mechanize 的编码映射,这样当它尝试使用 Iconv 将正文转换为 UTF-8 时,它也会将源编码作为 UTF-8 传递。你可以这样做:

WWW::Mechanize::Util::CODE_DIC[:SJIS] = "UTF-8"

require将其放置在 Mechanize 库所在的行之后。您可能希望在找到问题的根本原因并在必要时提交补丁后立即重新设置该值,甚至更好。

注意:我解决这个问题的方法是使用回溯调试 Mechanize 库。问题所在的to_native_charset方法调用。detect_charset

于 2009-02-25T15:12:55.517 回答
0

在我的情况下, aMechanize::File是由根本不使用编码的 get 方法返回的。
我可以通过手动转换来修复它Iconv,但这只有在你已经知道编码的情况下才有效。

result = @agent.get uri
# Mechanize::File instead of Mechanize::Page is returned 
# so we have to convert manually
result = Iconv.conv("utf-8", "iso-8859-1", result.body)
于 2013-02-04T15:42:45.453 回答