2

We're cleaning up some errors on our site after migration from ruby 1.8.7 to 1.9.3, Rails 3.2.12. We have one encoding error left -- Bing is sending requests for URLs in the form

/search?q=author:\"Andr\xc3\xa1s%20Guttman\"

(This reads /search?q=author:"András Guttman", where the á is escaped).

In fairness to Bing, we were the ones that gave them those bogus URLs, but ruby 1.9.3 isn't happy with them any more.

Our server is currently returning a 500. Rails is returning the error "Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT"

I am unable to reproduce this error in a browser, or via curl or wget from OS X or Linux command line.

I want to send a 301 redirect back with a properly encoded URL.

I am guessing that I want to:

  1. detect that the URL has old UTF-8 then if it is malformed, only
  2. use String#encode to get from old to new UTF-8
  3. use CGI.escape() to %-encode the URL
  4. 301 redirect to the corrected URL

So I have read a lot and am not sure how (or if) I can detect this bogus URL. I need to detect because otherwise I would have to 301 everything!

When I try in irb I get these results:

  • 1.9.3p392 :015 > foo = "/search?q=author:\"Andr\xc3\xa1s%20Guttman\""
  • => "/search?q=author:\"András%20Guttman\""
  • 1.9.3p392 :016 > "/search?q=author:\"Andr\xc3\xa1s%20Guttman\"".encoding
  • => #<Encoding:UTF-8>
  • 1.9.3p392 :017 > foo.encoding
  • => #<Encoding:UTF-8>

I have read this SO post but I am not sure if I have to go this far or even if this applies.

[Update: since posting, we have added a call to the code in the SO post linked above prior to all requests.]

So the question is: how can I detect the old-style encoding so that I can do the other steps.

4

2 回答 2

1

首先,让我们看一下字符串操作方面的事情。它看起来像使用 URI 模块并且取消转义然后重新转义将起作用:

2.0.0p0 :007 > foo = "/search?q=author:\"Andr\xc3\xa1s%20Guttman\""
=> "/search?q=author:\"András%20Guttman\""
2.0.0p0 :008 > URI.unescape foo
=> "/search?q=author:\"András Guttman\""
2.0.0p0 :009 > URI.escape URI.unescape foo
=> "/search?q=author:%22Andr%C3%A1s%20Guttman%22"

那么下一个问题是在哪里做呢?我想说尝试使用 \x 转义字符检测字符串的问题在于,您不能保证这些字符串不应该是斜杠 x 而不是转义(尽管在实践中,这可能是一个好的假设)。

您可能会考虑只添加一个小型机架中间件来执行此操作。有关机架的更多信息,请参阅此 Railscast。假设您仅在参数中获取这些(即,在 URL 中的 ? 之后),那么您的中间件看起来像(未经测试,仅用于说明;作为 reescape_parameters.rb 放置在您的 /lib 文件夹中):

require 'uri' # possibly not needed?

class ReescapeParameters
  def initialize(app)
    @app = app
  end

  def call(env)
    env['QUERY_STRING'] = URI.escape URI.unescape env['QUERY_STRING']
    status, headers, body = @app.call(env)
    [status, headers, body]
  end
end

然后,通过在应用程序配置或初始化程序中添加一行来使用中间件。例如,在 /config/application.rb 中(或者,在初始化程序中):

config.middleware.use "ReescapeParameters"

请注意,您可能需要在 Rails 处理任何参数之前捕获主题参数。我不确定您需要将它放在机架堆栈的哪个位置,但您更可能需要:

config.middleware.insert_before ActionDispatch::ParamsParser, ReescapeParameters

这会将它放在 ActionDispatch::ParamsParser 之前的堆栈中。您需要找出正确的模块来放置它。这只是一个猜测。(仅供参考:还有一个 insert_after 。)

更新(修订)

如果您必须检测到这些然后发送 301,您可以尝试:

  def call(env)
    if env['QUERY_STRING'].encoding.name == 'ASCII-8BIT'  # could be 'ASCII_8BIT' ?
      location = URI.escape URI.unescape env['QUERY_STRING']
      [301, {'Content-Type' => 'text','Location' => location}, '']
    else
      status, headers, body = @app.call(env)
      [status, headers, body]
    end
  end

这是一个试验——它可能匹配一切。但希望“常规”字符串被编码为其他内容(因此您只会收到 ASCII-8BIT 编码的错误)。

根据其中一条评论,您还可以转换而不是 unescape 和 escape:

location = env['QUERY_STRING'].encode('UTF-8')

但是您可能仍然需要对结果字符串进行 URI 转义(不确定,取决于您的情况)。

于 2013-06-25T17:50:05.180 回答
-1

请用CGI::unescapeHTML(string)

于 2013-07-01T16:26:36.380 回答