python - 使用 Python 和 mutagen 进行去mojibaking

Question

我正在阅读 mojibaked ID3 标签mutagen。我的目标是在学习编码和 Python 处理的同时修复 mojibake。

我正在使用的文件有一个ID3v2标签，我正在查看它的专辑 ( TALB) 帧，根据TALBID3 帧中的编码字节，它以 Latin-1 ( ISO-8859-1) 编码。然而，我知道这个帧中的字节是用cp1251（西里尔文）编码的。

到目前为止，这是我的代码：

 >>> from mutagen.mp3 import MP3
 >>> mp3 = MP3(paths[0])
 >>> mp3['TALB']
 TALB(encoding=0, text=[u'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'])

现在，如您所见，mp3['TALB'].text[0]这里表示为 Unicode 字符串。但是，它是mojibaked：

 >>> print mp3['TALB'].text[0]
 Áóðæóéñêèå ïëÿñêè

我在将这些cp1251字节转码为正确的 Unicode 代码点时运气不佳。到目前为止，我的最好成绩非常不合时宜：

>>> st = ''.join([chr(ord(x)) for x in mp3['TALB'].text[0]]); st
'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'
>>> print st.decode('cp1251')
Буржуйские пляски <-- **this is the correct, demojibaked text!**

据我了解这种方法，它之所以有效，是因为我最终将 Unicode 字符串转换为 8 位字符串，然后我可以将其解码为 Unicode，同时指定我要解码的编码。

问题是我不能decode('cp1251')直接使用 Unicode 字符串：

>>> st = mp3['TALB'].text[0]; st
u'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'
>>> print st.decode('cp1251')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/dmitry/dev/mp3_tag_encode_convert/lib/python2.7/encodings/cp1251.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)

有人可以解释一下吗？ascii当直接对u''字符串进行操作时，我无法理解如何使它不解码为 7 位范围。

score 5 · Accepted Answer

首先，用你知道它已经存在的编码对其进行编码。

>>> tag = u'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'
>>> raw = tag.encode('latin-1'); raw
'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'

然后您可以以正确的编码对其进行解码。

>>> fixed = raw.decode('cp1251'); print fixed
Буржуйские пляски

python - 使用 Python 和 mutagen 进行去mojibaking

1 回答 1

Related

Reference