0

我有一个输入字符串 tat 认为是 UTF-8,但不是,需要修复它。该代码在 ruby​​ 2 中,因此 iconv 不再存在,并且 encode 或 force_encode 无法按预期工作:

[5] pry(main)> a='zg\u0142oszeniem'
=> "zg\\u0142oszeniem"
[6] pry(main)> a.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
=> "zg\\u0142oszeniem"
[8] pry(main)> a.encode!(Encoding::UTF_8, :invalid => :replace, :undef => :replace, :replace => "?")
=> "zg\\u0142oszeniem"
[10] pry(main)> a.force_encoding(Encoding::UTF_8)
=> "zg\\u0142oszeniem"

我该如何解决?

4

1 回答 1

1

这是使用正则表达式的解决方案:

a.gsub(/\\u([0-9a-fA-F]{1,5}|10[0-9a-fA-F]{4})/) { $1.hex.chr(Encoding::UTF_8) } 1

它应该适用于该特定字符串:

[1] pry(main)> before = 'zg\u0142oszeniem'
=> "zg\\u0142oszeniem"
[2] pry(main)> before.split('')
=> ["z", "g", "\\", "u", "0", "1", "4", "2", "o", "s", "z", "e", "n", "i", "e", "m"]
[3] pry(main)> after = before.gsub(/\\u([0-9a-fA-F]{1,5}|10[0-9a-fA-F]{4})/) { $1.hex.chr(Encoding::UTF_8) }
=> "zgłoszeniem"
[4] pry(main)> after.split('')
=> ["z", "g", "ł", "o", "s", "z", "e", "n", "i", "e", "m"]

[1] Unicode 代码点的范围可以从 0 到 10FFFF 16第 3.4 节,字符和编码中的定义 D9),这应该解释了为什么上面的正则表达式看起来像这样。

于 2018-03-17T22:46:52.970 回答