ruby - Ruby Mechanize 中 text/csv Content-Encoding = UTF-8 的问题

Question

当尝试使用 Mechanize V2.5.1 加载具有 UTF-8 编码的 CSV 页面时，我使用了以下代码：

a.content_encoding_hooks << lambda{|httpagent, uri, response, body_io|
 response['Content-Encoding'] = 'none' if response['Content-Encoding'].to_s == 'UTF-8'
}
p4 = a.get(redirect_url, nil, ['accept-encoding' => 'UTF-8'])

但我发现内容编码钩子没有被调用，我得到以下错误和回溯：

/Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:787:in 'response_content_encoding': unsupported content-encoding: UTF-8 (Mechanize::Error)
    from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:274:in 'fetch'
    from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:949:in 'response_redirect'
    from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:299:in 'fetch'
    from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:949:in 'response_redirect'
    from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:299:in 'fetch'
    from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize.rb:407:in 'get'
    from prototype/test1.rb:307:in `<main>'

有谁知道为什么内容挂钩代码没有触发以及为什么我收到错误？

score 1 · Accepted Answer

但我发现内容编码钩子没有被调用

什么让你有那个想法？

错误消息引用此代码：

  def response_content_encoding response, body_io
    ...
    ...

    out_io = case response['Content-Encoding']
             when nil, 'none', '7bit', "" then
               body_io
             when 'deflate' then
               content_encoding_inflate body_io
             when 'gzip', 'x-gzip' then
               content_encoding_gunzip body_io
             else
               raise Mechanize::Error,
                 "unsupported content-encoding: #{response['Content-Encoding']}"

所以 mechanize 只识别内容编码：“7bit”、“deflate”、“gzip”或“x-gzip”。

从 HTTP/1.1 规范：

4.11 内容编码

Content-Encoding 实体头字段用作媒体类型的修饰符。当存在时，它的值指示已将哪些附加内容编码应用于实体主体，因此必须应用哪些解码机制才能获得 Content-Type 标头字段引用的媒体类型。Content-Encoding 主要用于允许在不丢失其底层媒体类型标识的情况下压缩文档。
   Content-Encoding  = "Content-Encoding" ":" 1#content-coding
内容编码在第 3.5 节中定义。它的一个使用例子是
   Content-Encoding: gzip
内容编码是由 Request-URI 标识的实体的特征。通常，实体主体以这种编码方式存储，并且仅在渲染或类似使用之前被解码。然而，如果新的编码被接收者接受，非透明代理可以修改内容编码，除非消息中出现“no-transform”缓存控制指令。

……

3.5 内容编码

内容编码值表示已经或可以应用于实体的编码转换。内容编码主要用于允许对文档进行压缩或以其他方式进行有用的转换，而不会丢失其底层媒体类型的身份，也不会丢失信息。通常，实体以编码形式存储，直接传输，并且仅由接收者解码。
   content-coding   = token
所有内容编码值都不区分大小写。HTTP/1.1 在 Accept-Encoding（第 14.3 节）和 Content-Encoding（第 14.11 节）标头字段中使用内容编码值。尽管该值描述了内容编码，但更重要的是它指示了删除编码需要什么解码机制。

互联网号码分配机构 (IANA) 充当内容编码价值令牌的注册机构。最初，注册表包含以下标记：

gzip由文件压缩程序“gzip”（GNU zip）产生的一种编码格式，如 RFC 1952 [25] 中所述。此格式是具有 32 位 CRC 的 Lempel-Ziv 编码 (LZ77)。

compress 由常见的 UNIX 文件压缩程序“compress”产生的编码格式。这种格式是自适应 Lempel-Ziv-Welch 编码 (LZW)。
    Use of program names for the identification of encoding formats
    is not desirable and is discouraged for future encodings. Their
    use here is representative of historical practice, not good
    design. For compatibility with previous implementations of HTTP,
    applications SHOULD consider "x-gzip" and "x-compress" to be
    equivalent to "gzip" and "compress" respectively.
deflate RFC 1950 [31] 中定义的“zlib”格式与 RFC 1951 [29] 中描述的“deflate”压缩机制相结合。

identity 默认（身份）编码；不使用任何转换。此内容编码仅在 Accept-Encoding 标头中使用，不应在 Content-Encoding 标头中使用。
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.5

换句话说，http内容编码与 ascii v. utf-8 v. latin-1 无关。

此外，Mechanize::HTTP::Agent 的源代码中包含以下内容：

  # A list of hooks to call after retrieving a response.  Hooks are called with
  # the agent and the response returned.
  attr_reader :post_connect_hooks

  # A list of hooks to call before making a request.  Hooks are called with
  # the agent and the request to be performed.
  attr_reader :pre_connect_hooks

  # A list of hooks to call to handle the content-encoding of a request.
  attr_reader :content_encoding_hooks

所以看起来你甚至没有调用正确的钩子。

这是我开始工作的一个例子：

require 'mechanize'

a = Mechanize.new

p a.content_encoding_hooks

func = lambda do |a, uri, resp, body_io| 
  puts body_io.read
  puts "The Content-Encoding is: #{resp['Content-Encoding']}"

  if resp['Content-Encoding'].to_s == 'UTF-8'
    resp['Content-Encoding'] = 'none'
  end

  puts "The Content-Encoding is now: #{resp['Content-Encoding']}"
end

a.content_encoding_hooks << func

a.get(
  'http://localhost:8080/cgi-bin/myprog.rb',
  [],
  nil,
  "Accept-Encoding" => 'gzip, deflate'  #This is what Firefox always uses
)

myprog.rb：

#!/usr/bin/env ruby

require 'cgi'

cgi = CGI.new('html3')

headers = {
  "type" => 'text/html',
  "Content-Encoding" => "UTF-8",
}

cgi.out(headers) do
  cgi.html() do
    cgi.head{ cgi.title{"Content-Encoding Test"} } +
    cgi.body() do
      cgi.div(){ "The Accept-Encoding was: #{cgi.accept_encoding}" }
    end
  end
end

--output:--
[]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"><HTML><HEAD><TITLE>Content-Encoding Test</TITLE></HEAD><BODY><DIV>The Accept-Encoding was: gzip, deflate</DIV></BODY></HTML>
The Content-Encoding is: UTF-8
The Content-Encoding is now: none

ruby - Ruby Mechanize 中 text/csv Content-Encoding = UTF-8 的问题

1 回答 1

Related

Reference