python - 从python中的字符串中删除前两个字节

Question

我需要从字符串中删除字节顺序标记。我已经有了查找 BOM 的代码，但现在我需要将其从实际字符串中删除。

给你举个例子。BOMfeff的长度为 2 个字节，这意味着字符串的前两个字节不应出现在最终字符串中。但是，当我使用 Python 字符串剥离时，会从字符串中剥离太多内容。

代码片段：

print len(bom)
print as_hex(bom)
print string
print as_hex(string)
string = string[len(bom):]
print string
print as_hex(string)

输出：

2
feff
Organ
feff4f7267616e
rgan
7267616e

我希望得到的是：

2
feff
Organ
feff4f7267616e
Organ
4f7267616e

该as_hex()函数只是将字符打印为十六进制 ( "".join('%02x' % ord(c) for c in bytes))。

score 4 · Accepted Answer

我认为你有一个 unicode 字符串对象。（如果您使用的是 Python 3，您当然会这样做，因为它是唯一一种字符串。）您的 as_hex 函数不会打印出第一个字符的“fe”和第二个字符的“ff”。它为字符串中的第一个 unicode 字符打印出“feff”。例如（Python 3）：

>>> mystr = "\ufeffHello world."
>>> mystr[0]
'\ufeff'
>>> '%02x' % ord(mystr[0])
'feff'

您要么只需要删除一个 unicode 字符，要么将字符串存储在一个bytes对象中并删除两个字节。

（这并不能解释为什么 len(bom) 是 2，如果没有看到更多你的代码，我无法分辨。我猜 bom 是一个list或一个bytes对象，而不是一个 unicode 字符串。）

我上面的回答假设 Python 3，但我从您的打印语句中意识到您正在使用 Python 2。基于此，我猜这bom是一个 ASCII 字符串，string而是一个 unicode 字符串。如果你使用它print repr(x)而不是print x它，你可以分辨出 unicode 和 ASCII 字符串之间的区别。

score 0 · Accepted Answer

使用正确的编解码器，BOM 将为您处理。utf-8-sig使用和解码utf16将删除前导 BOM（如果存在）。使用它们进行编码将添加 BOM。如果您不需要 BOM，请使用utf-8,utf-16le或utf-16be.

在将文本数据读入程序时通常应该解码为 Unicode，而在写入文件、控制台、套接字等时应编码为字节。

unicode_str = u'test'
utf8_w_bom = unicode_str.encode('utf-8-sig')
utf16_w_bom = unicode_str.encode('utf16')
utf8_wo_bom = unicode_str.encode('utf-8')
utf16_wo_bom = unicode_str.encode('utf-16le')
print repr(utf8_w_bom)
print repr(utf16_w_bom)
print repr(utf8_wo_bom)
print repr(utf16_wo_bom)
print repr(utf8_w_bom.decode('utf-8-sig'))
print repr(utf16_w_bom.decode('utf16'))
print repr(utf8_wo_bom.decode('utf-8-sig'))
print repr(utf16_wo_bom.decode('utf16'))

输出：

'\xef\xbb\xbftest'
'\xff\xfet\x00e\x00s\x00t\x00'
'test'
't\x00e\x00s\x00t\x00'
u'test'
u'test'
u'test'
u'test'

请注意，utf16如果没有 BOM，解码时将采用本机字节顺序。

python - 从python中的字符串中删除前两个字节

2 回答 2

Related

Reference