我使用了以下代码(Ruby 1.8.7)。它测试每个 char >= 128 ASCII 以检查它是否是有效 utf-8 序列的开始。如果不是,则假定为 iso8859-1 并将其转换为 utf-8。
由于您的文件很大,这个过程可能会很慢!
class String
# Grants each char in the final string is utf-8-compliant.
# based on http://php.net/manual/en/function.utf8-encode.php#39986
def utf8
ret = ''
# scan the string
# I'd use self.each_byte do |b|, but I'll need to change i
a = self.unpack('C*')
i = 0
l = a.length
while i < l
b = a[i]
i += 1
# if it's ascii, don't do anything.
if b < 0x80
ret += b.chr
next
end
# check whether it's the beginning of a valid utf-8 sequence
m = [0xc0, 0xe0, 0xf0, 0xf8, 0xfc, 0xfe]
n = 0
n += 1 until n > m.length || (b & m[n]) == m[n-1]
# if not, convert it to utf-8
if n > m.length
ret += [b].pack('U')
next
end
# if yes, check if the rest of the sequence is utf8, too
r = [b]
u = false
# n bytes matching 10bbbbbb follow?
n.times do
if i < l
r << a[i]
u = (a[i] & 0xc0) == 0x80
i += 1
else
u = false
end
break unless u
end
# if not, converts it!
ret += r.pack(u ? 'C*' : 'U*')
end
ret
end
def utf8!
replace utf8
end
end
# let s be the string containing your file.
s2 = s.utf8
# or
s.utf8!