3

我有一个从客户端提供的 CSV 文件,必须使用 PHP 对其进行解析并插入到数据库中。

在将数据插入数据库之前,我想将其转换为 UTF-8,但我似乎找不到如何。

这就是我试图检测文件编码的原因:

$ enca -d -L zh ./artigos.txt 
    ./artigos.txt: Universal character set 2 bytes; UCS-2; BMP
    CRLF line terminators
    Byte order reversed in pairs (1,2 -> 2,1)

我尝试使用 iconv 函数,但它弄乱了转换,并使用与原始字符不同的字符显示结果。

文件的第一行(base64 编码):

IgAwADMAMQAxADkAIgAsACIANwAzADEAMwA0ADYAMgA2ADQAMAAwADEANQAiACwAIgBBAGcAcgBhAGYAYQBkAG8AcgAgAFIAYQBwAGkAZAAgADkAIABIAGUAYQB2AHkAIABEAHUAdAB5ACIALAAiAEEAZwByAGEAZgBvACAAOQAvADgALAAgADkALwAxADAALAAgADkALwAxADIALAAgADkALwAxADQAIgAsACIAMQAxADAAZgBsAHMAIgAsACIAIgAsACIAIgAsACIAIgAsACIAMAAzADEAMQA5AC4AagBwAGcAIgAsACIAIgAsACIAMQAsADIAMAAiACwAIgA1ADkALAA5ADAAIgAsACIAMgAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIARgBhAGwAcwBlACIADQAK
4

3 回答 3

8

Microsoft Excel CSV 通常是 Little Endian 编码的(我花了很长时间才找到)。如果您想将它们与fgetcsv或类似功能一起使用,您应该先将文件转换为 UTF-8。

我执行以下操作:

$str = file_get_contents($file);
$str = mb_convert_encoding($str, 'UTF-8', 'UCS-2LE'); 
file_put_contents("converted_".$file, $str);
于 2016-09-02T12:43:02.213 回答
3

这似乎有效(小端),尽管您没有包含任何非 ascii 字符

$s='IgAwADMAMQAxADkAIgAsACIANwAzADEAMwA0ADYAMgA2ADQAMAAwADEANQAiACwAIgBBAGcAcgBhAGYAYQBkAG8AcgAgAFIAYQBwAGkAZAAgADkAIABIAGUAYQB2AHkAIABEAHUAdAB5ACIALAAiAEEAZwByAGEAZgBvACAAOQAvADgALAAgADkALwAxADAALAAgADkALwAxADIALAAgADkALwAxADQAIgAsACIAMQAxADAAZgBsAHMAIgAsACIAIgAsACIAIgAsACIAIgAsACIAMAAzADEAMQA5AC4AagBwAGcAIgAsACIAIgAsACIAMQAsADIAMAAiACwAIgA1ADkALAA5ADAAIgAsACIAMgAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIARgBhAGwAcwBlACIADQAK';
$t=base64_decode($s);
echo iconv('UCS-2LE', 'UTF-8', substr($t, 0, -1));//last byte was invalid
于 2012-06-08T05:07:14.497 回答
0

Python :

编码的方法之一是

文本 -> utf-16-be -> 十六进制

转换回来

十六进制转二进制,然后从 utf-16-be 转为文本

注意: ucs-2be 已弃用并移至 utf-16-be

解码器

import binascii
code = '098 ... '
decoded_text = binascii.unhexlify(code).decode('utf-16-be')
于 2018-04-16T11:10:38.847 回答