ruby - Ruby 1.9.3 Dir.glob 不返回 NFC UTF-8 字符串，而是返回 NFD

Question

从 Ruby 1.9.3 读取文件名时，我看到了一些奇怪的结果。例如，使用以下测试 ruby 脚本，在包含名为“Testé.txt”的文件的文件夹中运行

#!encoding:UTF-8
def inspect_string s
    puts "Source encoding: #{"".encoding}"
    puts "External encoding: #{Encoding.default_external}"
    puts "Name: #{s.inspect}"
    puts "Encoding: #{s.encoding}"
    puts "Chars: #{s.chars.to_a.inspect}"
    puts "Codepoints: #{s.codepoints.to_a.inspect}"
    puts "Bytes: #{s.bytes.to_a.inspect}"
end

def transform_string s
   puts "Testing string #{s}"
   puts s.gsub(/é/u,'TEST')
end

Dir.glob("./*.txt").each do |f|  

   puts RUBY_VERSION + RUBY_PLATFORM

   puts "Inline string works as expected" 
   s = "./Testé.txt" 
   inspect_string s
   puts transform_string s

   puts "File name from Dir.glob does not" 
   inspect_string f
   puts transform_string f

end

在 Mac OS X Lion 上，我看到以下结果：

1.9.3x86_64-darwin11.4.0
Inline string works as expected
Source encoding: UTF-8
External encoding: UTF-8
Name: "./Testé.txt"
Encoding: UTF-8
Chars: [".", "/", "T", "e", "s", "t", "é", ".", "t", "x", "t"]
Codepoints: [46, 47, 84, 101, 115, 116, 233, 46, 116, 120, 116]
Bytes: [46, 47, 84, 101, 115, 116, 195, 169, 46, 116, 120, 116]
Testing string ./Testé.txt
./TestTEST.txt

File name from Dir.glob does not
Source encoding: UTF-8
External encoding: UTF-8
Name: "./Testé.txt"
Encoding: UTF-8
Chars: [".", "/", "T", "e", "s", "t", "e", "́", ".", "t", "x", "t"]
Codepoints: [46, 47, 84, 101, 115, 116, 101, 769, 46, 116, 120, 116]
Bytes: [46, 47, 84, 101, 115, 116, 101, 204, 129, 46, 116, 120, 116]
Testing string ./Testé.txt
./Testé.txt

预期的最后一行是

./TestTEST.txt

返回的编码表明这是一个普通的 UTF-8 字符串，但任何涉及 unicode 的正则表达式转换都没有正确应用。

score 3 · Accepted Answer

对此的更新：Ruby 2.2.0 获得了String#unicode_normalize。

f.unicode_normalize!

将从 OSX 的 HFS+ 文件系统返回的 NFD 分解字符串转换为 NFC 组合字符串。您可以指定:nfd, :nfkc, 或者:nfkd如果您需要替代规范化。

score 0 · Accepted Answer

发布以防万一这对遇到此问题的其他人有用：

如果您使用 UTF-8 编码，Ruby 1.9 和 2.0 将使用组合的 UTF-8 字符串，但不会修改从操作系统接收的字符串。Mac OS X 使用分解的字符串（两个字节用于许多常见的重音符号，如 UTF-8 中的 é，它们被组合以显示）。所以文件系统方法经常会返回意想不到的字符串格式，严格来说是 UTF-8，但是是一种分解的形式。

为了解决这个问题，您需要通过将“UTF8-MAC”编码转换为 UTF-8 来分解它们：

f.encode!('UTF-8','UTF8-MAC')

在使用它们之前，否则您最终可能会使用组合的原生 ruby 字符串对分解的字符串进行检查。

此行为会影响文件名包含 unicode 字符的文件和文件夹的所有文件系统调用，例如 glob。

苹果文档：

http://developer.apple.com/library/mac/#qa/qa1235/_index.html

ruby - Ruby 1.9.3 Dir.glob 不返回 NFC UTF-8 字符串，而是返回 NFD

2 回答 2

Related

Reference