4

我需要将阿拉伯文本从 windows-1256 转换为 utf-8 我该怎么做?有什么帮助吗?

谢谢

4

4 回答 4

3

试试lua-iconv,它将 iconv 绑定到 Lua。

于 2013-05-18T12:57:31.133 回答
2
local win2utf_list = [[
0x00    0x0000  #NULL
0x01    0x0001  #START OF HEADING
0x02    0x0002  #START OF TEXT
-- Download full text from 
-- http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1256.TXT
0xFD    0x200E  #LEFT-TO-RIGHT MARK
0xFE    0x200F  #RIGHT-TO-LEFT MARK
0xFF    0x06D2  #ARABIC LETTER YEH BARREE
]]

local win2utf = {}

for w, u in win2utf_list:gmatch'0x(%x%x)%s+0x(%x+)' do
   local c, t, h = tonumber(u,16), {}, 128
   while c >= h do
      t[#t+1] = 128 + c%64
      c = math.floor(c/64)
      h = h > 32 and 32 or h/2
   end
   t[#t+1] = 256 - 2*h + c
   win2utf[w.char(tonumber(w,16))] = 
      w.char((table.unpack or unpack)(t)):reverse()
end

local function convert_to_utf8(win_string)
   return win_string:gsub('.', win2utf)
end
于 2013-05-18T18:47:23.577 回答
0

Windows-1256 是设计为 8 位 ASCII 覆盖的字符集之一。因此它有 256 个字符,每个字符编码为一个字节。

UTF-8 是 Unicode 字符集的编码。作为“通用”,它是 Windows-1256 字符集的超集。因此,不必使用“替换字符”代替不是字符集成员的字符,不会丢失信息。

转换是将每个字符的 Windows-1256 字节转换为相应的 UTF-8 字节的简单问题。查找表是一种简单的方法。

local encoding = {
-- table maps the one byte Windows-1256 encoding for a character to a Lua string with the UTF-8 encoding for the character

"\000"        , "\001"        , "\002"        , "\003"        , "\004"        , "\005"        , "\006"        , "\007"        ,
"\008"        , "\009"        , "\010"        , "\011"        , "\012"        , "\013"        , "\014"        , "\015"        ,
"\016"        , "\017"        , "\018"        , "\019"        , "\020"        , "\021"        , "\022"        , "\023"        ,
"\024"        , "\025"        , "\026"        , "\027"        , "\028"        , "\029"        , "\030"        , "\031"        ,
"\032"        , "\033"        , "\034"        , "\035"        , "\036"        , "\037"        , "\038"        , "\039"        ,
"\040"        , "\041"        , "\042"        , "\043"        , "\044"        , "\045"        , "\046"        , "\047"        ,
"\048"        , "\049"        , "\050"        , "\051"        , "\052"        , "\053"        , "\054"        , "\055"        ,
"\056"        , "\057"        , "\058"        , "\059"        , "\060"        , "\061"        , "\062"        , "\063"        ,
"\064"        , "\065"        , "\066"        , "\067"        , "\068"        , "\069"        , "\070"        , "\071"        ,
"\072"        , "\073"        , "\074"        , "\075"        , "\076"        , "\077"        , "\078"        , "\079"        ,
"\080"        , "\081"        , "\082"        , "\083"        , "\084"        , "\085"        , "\086"        , "\087"        ,
"\088"        , "\089"        , "\090"        , "\091"        , "\092"        , "\093"        , "\094"        , "\095"        ,
"\096"        , "\097"        , "\098"        , "\099"        , "\100"        , "\101"        , "\102"        , "\103"        ,
"\104"        , "\105"        , "\106"        , "\107"        , "\108"        , "\109"        , "\110"        , "\111"        ,
"\112"        , "\113"        , "\114"        , "\115"        , "\116"        , "\117"        , "\118"        , "\119"        ,
"\120"        , "\121"        , "\122"        , "\123"        , "\124"        , "\125"        , "\126"        , "\127"        ,
"\226\130\172", "\217\190"    , "\226\128\154", "\198\146"    , "\226\128\158", "\226\128\166", "\226\128\160", "\226\128\161",
"\203\134"    , "\226\128\176", "\217\185"    , "\226\128\185", "\197\146"    , "\218\134"    , "\218\152"    , "\218\136"    ,
"\218\175"    , "\226\128\152", "\226\128\153", "\226\128\156", "\226\128\157", "\226\128\162", "\226\128\147", "\226\128\148",
"\218\169"    , "\226\132\162", "\218\145"    , "\226\128\186", "\197\147"    , "\226\128\140", "\226\128\141", "\218\186"    ,
"\194\160"    , "\216\140"    , "\194\162"    , "\194\163"    , "\194\164"    , "\194\165"    , "\194\166"    , "\194\167"    ,
"\194\168"    , "\194\169"    , "\218\190"    , "\194\171"    , "\194\172"    , "\194\173"    , "\194\174"    , "\194\175"    ,
"\194\176"    , "\194\177"    , "\194\178"    , "\194\179"    , "\194\180"    , "\194\181"    , "\194\182"    , "\194\183"    ,
"\194\184"    , "\194\185"    , "\216\155"    , "\194\187"    , "\194\188"    , "\194\189"    , "\194\190"    , "\216\159"    ,
"\219\129"    , "\216\161"    , "\216\162"    , "\216\163"    , "\216\164"    , "\216\165"    , "\216\166"    , "\216\167"    ,
"\216\168"    , "\216\169"    , "\216\170"    , "\216\171"    , "\216\172"    , "\216\173"    , "\216\174"    , "\216\175"    ,
"\216\176"    , "\216\177"    , "\216\178"    , "\216\179"    , "\216\180"    , "\216\181"    , "\216\182"    , "\195\151"    ,
"\216\183"    , "\216\184"    , "\216\185"    , "\216\186"    , "\217\128"    , "\217\129"    , "\217\130"    , "\217\131"    ,
"\195\160"    , "\217\132"    , "\195\162"    , "\217\133"    , "\217\134"    , "\217\135"    , "\217\136"    , "\195\167"    ,
"\195\168"    , "\195\169"    , "\195\170"    , "\195\171"    , "\217\137"    , "\217\138"    , "\195\174"    , "\195\175"    ,
"\217\139"    , "\217\140"    , "\217\141"    , "\217\142"    , "\195\180"    , "\217\143"    , "\217\144"    , "\195\183"    ,
"\217\145"    , "\195\185"    , "\217\146"    , "\195\187"    , "\195\188"    , "\226\128\142", "\226\128\143", "\219\146"   
}

--

encoding.convert = function(str)
    assert(type(str) == "string", "Parameter 1 must be a string")
    local result = {}
    for i = 1, string.len(str) do
        table.insert(result, encoding[string.byte(str,i)+1])
    end
    return table.concat(result)
end
assert(encoding.convert("test1") == "test1", "test1 failed")

参考:

Joel Spolsky,每个软件开发人员绝对、绝对必须了解 Unicode 和字符集的绝对最低要求(没有借口!)

Roberto Ierusalimschy,逐段创作弦乐

于 2013-05-18T19:49:02.397 回答
0

通常,从一个代码页(字符集)转换为另一个。必须使用映射表。

比如:http ://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1256.TXT ,从 CP1256 到 Unicode。

然后从 Unicode 转换为 Utf8(编码/解码方法在 Unicode 和 UTF-8 之间工作,没有大图)。

于 2020-11-17T17:22:58.650 回答