encoding - 从双 UTF-8 到 UTF-8（或从 UTF-8 到 ANSI）的“原始”转换

Question

我正在处理一个使用 UTF-8 编码两次的旧文件。例如，代码点ε( U+03B5) 应该被编码为，CE B5但被编码为C3 8E C2 B5(CE 8E是的 UTF-8 编码U+00CE，C2 B5是的 UTF-8 编码U+00B5)。

假设数据在 CP-1252 中编码，已执行第二次编码。

要回到 UTF-8 编码，我使用以下（似乎是错误的）命令

iconv --from utf8 --to cp1252 <file.double-utf8 >file.utf8

我的问题是 iconv 似乎无法转换回某些字符。更准确地说，iconv 无法转换 UTF-8 表示包含映射到CP-1252 中的控制字符的字符的字符。一个示例是代码点ρ( U+03C1)：

它的 UTF-8 编码是CF 81,
第一个字节CF被重新编码为C3 8F,
第二个字节81被重新编码为C2 81.

iconv 拒绝转换C2 81回81，可能是因为它不知道如何精确映射该控制字符。

echo -e -n '\xc3\x8f\xc2\x81' | iconv --from utf8 --to cp1252
�iconv: illegal input sequence at position 2

如何告诉 iconv 只执行数学 UTF-8 转换而不关心映射？

score 2 · Accepted Answer

echo -e -n '\xc3\x8f\xc2\x81' | iconv --from utf8 --to iso8859-1

Windows-1252 在 0x80-0x9F 范围内与 ISO-8859-1 不同。例如，在您的情况下，0x81 在 ISO 8859-1 中是 U+0081，但在 Windows-1252 中无效。

检查其余数据是否被误解为 Windows-1252 或 ISO 8859-1。通常，ISO 8859-1 更为常见。

score 0 · Accepted Answer

以下代码使用 Ruby 的低级编码函数强制将双重编码的 UTF-8（来自 CP1525）重写为普通的 UTF-8。

#!/usr/bin/env ruby

ec = Encoding::Converter.new(Encoding::UTF_8, Encoding::CP1252)

prev_b = nil

orig_bytes = STDIN.read.force_encoding(Encoding::BINARY).bytes.to_a
real_utf8_bytes = ""
real_utf8_bytes.force_encoding(Encoding::BINARY)

orig_bytes.each_with_index do |b, i|
    b = b.chr

    situation = ec.primitive_convert(b.dup, real_utf8_bytes, nil, nil, Encoding::Converter::PARTIAL_INPUT)

    if situation == :undefined_conversion
            if prev_b != "\xC2"
                    $stderr.puts "ERROR found byte #{b.dump} in stream (prev #{(prev_b||'').dump})"
                    exit
            end

            real_utf8_bytes.force_encoding(Encoding::BINARY)
            real_utf8_bytes << b
            real_utf8_bytes.force_encoding(Encoding::CP1252)
    end

    prev_b = b
end

real_utf8_bytes.force_encoding(Encoding::BINARY)
puts real_utf8_bytes

它旨在用于管道：

cat $PROBLEMATIC_FILE | ./fix-double-utf8-encoding.rb > $CORRECTED_FILE

encoding - 从双 UTF-8 到 UTF-8（或从 UTF-8 到 ANSI）的“原始”转换

2 回答 2

Related

Reference