3

我得到一个带有 eyeD3 的 mp3 标签(ID V1),并想了解它的编码。这是我尝试的:

>>> print(type(mp3artist_v1))
<type 'unicode'>

>>> print(type(mp3artist_v1.encode('utf-8')))
<type 'str'>

>>> print(mp3artist_v1)
Zåìôèðà

>>> print(mp3artist_v1.encode('utf-8').decode('cp1252'))
Zåìôèðà 

>>> print(u'Zемфира'.encode('utf-8').decode('cp1252'))
Zемфира

如果我使用在线工具来解码该值,它表示可以通过更改编码将值Zемфира转换为正确值,并通过更改编码(如.ZемфираCP1252 → UTF-8ZåìôèðàCP1252 → CP1251

我应该怎么做才能从中Zемфира得到mp3artist_v1.encode('cp1252').decode('cp1251')效果很好,但是我怎样才能自动理解可能的编码(只有 3 种编码是可能的 - cp1251, cp1252, utf-8?我打算使用以下代码:

def forceDecode(string, codecs=['utf-8', 'cp1251', 'cp1252']):
    for i in codecs:
        try:
            print(i)
            return string.decode(i)
        except:
            pass
    print "cannot decode url %s" % ([string]) 

但这无济于事,因为我应该先用一个字符集编码,然后再用另一个字符集解码。

4

1 回答 1

6

This

s = u'Zåìôèðà'
print s.encode('latin1').decode('cp1251')
# Zемфира

Explanation: Zåìôèðà is mistakenly treated as a unicode string, while it's actually a sequence of bytes, which mean Zемфира in cp1251. By applying encode('latin1') we convert this "unicode" string back to bytes, using codepoint numbers as byte values, and then convert these bytes back to unicode telling the decode we're using cp1251.

As to automatic decoding, the following brute force approach seems to work with your examples:

import re, itertools

def guess_decode(s):
    encodings = ['cp1251', 'cp1252', 'utf8']

    for steps in range(2, 10, 2):
        for encs in itertools.product(encodings, repeat=steps):
            r = s
            try:
                for enc in encs:
                    r = r.encode(enc) if isinstance(r, unicode) else r.decode(enc)
            except (UnicodeEncodeError, UnicodeDecodeError) as e:
                continue
            if re.match(ur'^[\w\sа-яА-Я]+$', r):
                print 'debug', encs, r
                return r

print guess_decode(u'Zемфира')
print guess_decode(u'Zåìôèðà')
print guess_decode(u'ZåìôèðÃ\xA0')

Results:

debug ('cp1252', 'utf8') Zемфира
Zемфира
debug ('cp1252', 'cp1251') Zемфира
Zемфира
debug ('cp1252', 'utf8', 'cp1252', 'cp1251') Zемфира
Zемфира
于 2014-04-27T19:06:02.887 回答