我需要将阿拉伯文本从 windows-1256 转换为 utf-8 我该怎么做?有什么帮助吗?
谢谢
试试lua-iconv,它将 iconv 绑定到 Lua。
local win2utf_list = [[
0x00 0x0000 #NULL
0x01 0x0001 #START OF HEADING
0x02 0x0002 #START OF TEXT
-- Download full text from
-- http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1256.TXT
0xFD 0x200E #LEFT-TO-RIGHT MARK
0xFE 0x200F #RIGHT-TO-LEFT MARK
0xFF 0x06D2 #ARABIC LETTER YEH BARREE
]]
local win2utf = {}
for w, u in win2utf_list:gmatch'0x(%x%x)%s+0x(%x+)' do
local c, t, h = tonumber(u,16), {}, 128
while c >= h do
t[#t+1] = 128 + c%64
c = math.floor(c/64)
h = h > 32 and 32 or h/2
end
t[#t+1] = 256 - 2*h + c
win2utf[w.char(tonumber(w,16))] =
w.char((table.unpack or unpack)(t)):reverse()
end
local function convert_to_utf8(win_string)
return win_string:gsub('.', win2utf)
end
Windows-1256 是设计为 8 位 ASCII 覆盖的字符集之一。因此它有 256 个字符,每个字符编码为一个字节。
UTF-8 是 Unicode 字符集的编码。作为“通用”,它是 Windows-1256 字符集的超集。因此,不必使用“替换字符”代替不是字符集成员的字符,不会丢失信息。
转换是将每个字符的 Windows-1256 字节转换为相应的 UTF-8 字节的简单问题。查找表是一种简单的方法。
local encoding = {
-- table maps the one byte Windows-1256 encoding for a character to a Lua string with the UTF-8 encoding for the character
"\000" , "\001" , "\002" , "\003" , "\004" , "\005" , "\006" , "\007" ,
"\008" , "\009" , "\010" , "\011" , "\012" , "\013" , "\014" , "\015" ,
"\016" , "\017" , "\018" , "\019" , "\020" , "\021" , "\022" , "\023" ,
"\024" , "\025" , "\026" , "\027" , "\028" , "\029" , "\030" , "\031" ,
"\032" , "\033" , "\034" , "\035" , "\036" , "\037" , "\038" , "\039" ,
"\040" , "\041" , "\042" , "\043" , "\044" , "\045" , "\046" , "\047" ,
"\048" , "\049" , "\050" , "\051" , "\052" , "\053" , "\054" , "\055" ,
"\056" , "\057" , "\058" , "\059" , "\060" , "\061" , "\062" , "\063" ,
"\064" , "\065" , "\066" , "\067" , "\068" , "\069" , "\070" , "\071" ,
"\072" , "\073" , "\074" , "\075" , "\076" , "\077" , "\078" , "\079" ,
"\080" , "\081" , "\082" , "\083" , "\084" , "\085" , "\086" , "\087" ,
"\088" , "\089" , "\090" , "\091" , "\092" , "\093" , "\094" , "\095" ,
"\096" , "\097" , "\098" , "\099" , "\100" , "\101" , "\102" , "\103" ,
"\104" , "\105" , "\106" , "\107" , "\108" , "\109" , "\110" , "\111" ,
"\112" , "\113" , "\114" , "\115" , "\116" , "\117" , "\118" , "\119" ,
"\120" , "\121" , "\122" , "\123" , "\124" , "\125" , "\126" , "\127" ,
"\226\130\172", "\217\190" , "\226\128\154", "\198\146" , "\226\128\158", "\226\128\166", "\226\128\160", "\226\128\161",
"\203\134" , "\226\128\176", "\217\185" , "\226\128\185", "\197\146" , "\218\134" , "\218\152" , "\218\136" ,
"\218\175" , "\226\128\152", "\226\128\153", "\226\128\156", "\226\128\157", "\226\128\162", "\226\128\147", "\226\128\148",
"\218\169" , "\226\132\162", "\218\145" , "\226\128\186", "\197\147" , "\226\128\140", "\226\128\141", "\218\186" ,
"\194\160" , "\216\140" , "\194\162" , "\194\163" , "\194\164" , "\194\165" , "\194\166" , "\194\167" ,
"\194\168" , "\194\169" , "\218\190" , "\194\171" , "\194\172" , "\194\173" , "\194\174" , "\194\175" ,
"\194\176" , "\194\177" , "\194\178" , "\194\179" , "\194\180" , "\194\181" , "\194\182" , "\194\183" ,
"\194\184" , "\194\185" , "\216\155" , "\194\187" , "\194\188" , "\194\189" , "\194\190" , "\216\159" ,
"\219\129" , "\216\161" , "\216\162" , "\216\163" , "\216\164" , "\216\165" , "\216\166" , "\216\167" ,
"\216\168" , "\216\169" , "\216\170" , "\216\171" , "\216\172" , "\216\173" , "\216\174" , "\216\175" ,
"\216\176" , "\216\177" , "\216\178" , "\216\179" , "\216\180" , "\216\181" , "\216\182" , "\195\151" ,
"\216\183" , "\216\184" , "\216\185" , "\216\186" , "\217\128" , "\217\129" , "\217\130" , "\217\131" ,
"\195\160" , "\217\132" , "\195\162" , "\217\133" , "\217\134" , "\217\135" , "\217\136" , "\195\167" ,
"\195\168" , "\195\169" , "\195\170" , "\195\171" , "\217\137" , "\217\138" , "\195\174" , "\195\175" ,
"\217\139" , "\217\140" , "\217\141" , "\217\142" , "\195\180" , "\217\143" , "\217\144" , "\195\183" ,
"\217\145" , "\195\185" , "\217\146" , "\195\187" , "\195\188" , "\226\128\142", "\226\128\143", "\219\146"
}
--
encoding.convert = function(str)
assert(type(str) == "string", "Parameter 1 must be a string")
local result = {}
for i = 1, string.len(str) do
table.insert(result, encoding[string.byte(str,i)+1])
end
return table.concat(result)
end
assert(encoding.convert("test1") == "test1", "test1 failed")
参考:
Joel Spolsky,每个软件开发人员绝对、绝对必须了解 Unicode 和字符集的绝对最低要求(没有借口!)
Roberto Ierusalimschy,逐段创作弦乐
通常,从一个代码页(字符集)转换为另一个。必须使用映射表。
比如:http ://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1256.TXT ,从 CP1256 到 Unicode。
然后从 Unicode 转换为 Utf8(编码/解码方法在 Unicode 和 UTF-8 之间工作,没有大图)。