只需确保使用 Unicode 进行所有比较。当然,您必须知道数据的原始编码。以下是相同 Unicode 字符的四种不同编码:
#!python3
s1 = b'\xce\xd2\xca\xc7\xc3\xc0\xb9\xfa\xc8\xcb'
s2 = b'\xe6\x88\x91\xe6\x98\xaf\xe7\xbe\x8e\xe5\x9b\xbd\xe4\xba\xba'
s3 = b'\x11b/f\x8e\x7f\xfdV\xbaN'
s4 = b'\x11b\x00\x00/f\x00\x00\x8e\x7f\x00\x00\xfdV\x00\x00\xbaN\x00\x00'
u1 = s1.decode('chinese')
u2 = s2.decode('utf8')
u3 = s3.decode('utf-16le')
u4 = s4.decode('utf-32le')
assert(u1==u2==u3==u4)
尽快将每个文本字符串转换为 Unicode。再次写出数据时,将其编码为您喜欢的编码。
对于使用 删除的文件\xE5
,首先处理原始数据以确定它是否是已删除条目。无需将已删除的文件处理为 Unicode:
if rawdata[0] = 0xE5:
print('deleted')
else:
print(rawdata.decode('utf-16le'))
编辑
今天下午我很无聊,这里有一个简短的 FAT32 解析器。它并不严格遵循FAT32 规范……仅足以说明解码:
#!python3
import binascii
import struct
# struct module unpacking formats
SHORT_ENTRY = '<11s3B7HL' # 12 fields described in FAT32 spec
LONG_ENTRY = '<B10s3B12sH4s' # 8 fields described in FAT32 spec
# attribute bit values (byte offset 11)
ATTR_READ_ONLY = 0x01
ATTR_HIDDEN = 0x02
ATTR_SYSTEM = 0x04
ATTR_VOLUME_ID = 0x08
ATTR_DIRECTORY = 0x10
ATTR_ARCHIVE = 0x20
LAST_LONG_ENTRY = 0x40
ATTR_LONG_NAME = ATTR_READ_ONLY | ATTR_HIDDEN | ATTR_SYSTEM | ATTR_VOLUME_ID
ATTR_LONG_NAME_MASK = ATTR_READ_ONLY | ATTR_HIDDEN | ATTR_SYSTEM | ATTR_VOLUME_ID | ATTR_DIRECTORY | ATTR_ARCHIVE
# A few entries from a FAT32 root directory (32 bytes per row)
data = binascii.unhexlify('''
42 FC 00 69 00 6E 00 6F 00 2E 00 0F 00 D9 6A 00 70 00 67 00 00 00 FF FF FF FF 00 00 FF FF FF FF
01 6C 9A 4B 51 6D 00 61 00 F1 00 0F 00 D9 61 00 6E 00 61 00 20 00 70 00 65 00 00 00 6E 00 67 00
4D 41 A5 41 4E 41 7E 31 4A 50 47 20 00 89 6D 8B FE 40 69 43 00 00 C7 7D 8B 3F 03 00 04 06 7D 00
41 11 62 2F 66 8E 7F FD 56 BA 4E 0F 00 DC 2E 00 74 00 78 00 74 00 00 00 FF FF 00 00 FF FF FF FF
46 32 33 33 7E 31 20 20 54 58 54 20 00 4B BA 7B 69 43 69 43 00 00 BB 7B 69 43 00 00 00 00 00 00
'''.strip().replace(' ','').replace('\n',''))
# Long names are built up from multiple entries, so start empty
raw_long = b''
# Iterate through the 32-byte entries in the data
for offset in range(0,len(data),32):
raw_entry = data[offset:offset+32]
# Entries that start with 0xE5 are deleted.
# An entry that starts with zero indicates no more entries
if raw_entry[0] == 0xE5: continue
if raw_entry[0] == 0: break
if raw_entry[11] & ATTR_LONG_NAME_MASK == ATTR_LONG_NAME:
# Long entries are found last-to-first and are in three parts
# per entry. Concatenate the parts and prepend to entries
# found so far.
entry = struct.unpack_from(LONG_ENTRY,data,offset)
raw_long = entry[1] + entry[5] + entry[7] + raw_long
else:
entry = struct.unpack_from(SHORT_ENTRY,data,offset)
# If the short entry is a volume ID, skip it.
if entry[2] == ATTR_VOLUME_ID: continue
# Unpack and decode 8.3 filename in OEM
# character set.
basename = entry[0][:8].decode('cp437').rstrip(' ')
ext = entry[0][8:].decode('cp437').rstrip(' ')
# Decode and strip the current long name value of padding.
long_name = raw_long.decode('utf-16le').rstrip('\uffff').rstrip('\0')
print('{:8}.{:3} - {}'.format(basename,ext,long_name))
raw_long = b'' # Reset the long name to empty
来自支持 UTF-8(不是 Windows 控制台)的 IDE 的输出:
MAÑANA~1.JPG - 马克mañana pengüino.jpg
F233~1 .TXT - 我是美国人.txt