python - 在 utf-8 Python 代码中使用不可编码的 mp4 标签名称

Question

由于我不清楚的原因，mp4 文件用作标签名称的某些字段包含不可打印的字符，至少是 mutagen 看到它们的方式。给我带来麻烦的是'\xa9wrt'，这是作曲家字段的标签名称（！？）。

如果我'\xa9wrt'.encode('utf-8')从 Python 控制台运行，我会得到

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: invalid start byte

我正在尝试从使用一些面向未来的 Python 文件中访问此值，包括：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

我什至不知道如何将字符串输入'\xa9wrt'到我的代码文件中，因为该文件中的所有内容都被解释为 utf-8，而我感兴趣的字符串显然不能用 utf-8 编写。此外，当我将字符串'\xa9wrt'放入变量（例如，来自诱变剂）时，很难使用。例如，"{}".format(the_variable)失败，因为"{}"被解释为u"{}"，它再次尝试将字符串编码为 utf-8。

只是天真地输入'\xa9wrt'给了我u'\xa9wrt'，这是不一样的，而且我尝试过的其他东西都没有奏效：

>>> u'\xa9wrt' == '\xa9wrt'
False
>>> str(u'\xa9wrt')
'\xc2\xa9wrt'
>>> str(u'\xa9wrt') == '\xa9wrt'
False

请注意，此输出来自控制台，似乎我可以输入非 Unicode 文字。我在 Mac OS 上使用 Spyder，带有sys.version = 2.7.6 |Anaconda 1.8.0 (x86_64)| (default, Nov 11 2013, 10:49:09)\n[GCC 4.0.1 (Apple Inc. build 5493)].

如何在 Unicode 世界中使用此字符串？utf-8 不能这样做吗？

更新： 谢谢@tsroten 的回答。它加深了我的理解，但我仍然无法达到我想要的效果。这是一个更尖锐的问题形式：我怎样才能用'??'到达两条线在他们不诉诸我正在使用的那些技巧的情况下？

请注意，str我正在使用的东西是由图书馆交给我的。我必须接受它作为那种类型

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

tagname = 'a9777274'.decode('hex') # This value comes from a library as a str, not a unicode
if u'\xa9wrt' == tagname:
    # ??: What test could I run that would get me here without resorting to writing my string in hex?
    print("You found the tag you're looking for!")
else:
    print("Keep looking!")

print(str("This will work: {}").format(tagname))
try:
    print("This will throw an exception: {}".format(tagname))
    # ??: Can I reach this line without resorting to converting my format string to a str?
except UnicodeDecodeError:
    print("Threw exception")

更新 2：

我认为您（@tsroten）构造的任何字符串都不等于我从诱变剂中获得的字符串。该字符串似乎仍然会导致问题：

>>> u = u'\xa9wrt'
>>> s = u.encode('utf-8')
>>> s2 = '\xa9wrt'
>>> s3 = 'a9777274'.decode('hex')
>>> s2 == s
False
>>> s2 == s3
True
>>> match_tag(s)
We have a match! tagname == ©wrt
Look! We printed tagname and no exception was raised.
>>> match_tag(s2)
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: invalid start byte

score 1 · Accepted Answer

\xa9是版权符号。有关更多信息，请参阅Unicode 标准中的C1 控件和 Latin-1 补充。

也许标签的©wrt意思是“版权”而不是“作曲家”？

当你跑步'\xa9wrt'.encode('utf-8')时，你得到的原因UnicodeDecodeError是因为encode()期望unicode，但你给了它str。因此，它首先将其转换为unicode，但假定str编码是'ascii'（或其他一些默认值）。这就是编码时出现解码错误的原因。应该使用unicode:来解决此问题u'\xa9wrt'.encode('utf-8')。

在 Python 解释器中，默认情况下type('')应该返回<type 'str'>. 如果在解释器中，您首先键入from __future__ import unicode_literals，type('')则应返回<type 'unicode'>。你说，只是天真地进入'\xa9wrt'给我u'\xa9wrt'，这是不一样的。但是，您的陈述有时是对的，有时是错误的。是否u'\xa9wrt' == '\xa9wrt'评估True或False取决于您是否已导入unicode_literals.

将以下内容复制、粘贴并保存到文件（例如test.py），然后从命令行运行python test.py。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

tag1 = u'\xa9wrt'
tag2 = '\xa9wrt'
print("tag1 = u'\\xa9wrt'")
print("tag2 = '\\xa9wrt'")
print("tag1: %s" % tag1)
print("tag2: %s" % tag1)
print("type(tag1): %s" % type(tag1))
print("type(tag2): %s" % type(tag2))
print("tag1 == tag2: %s" % (tag1 == tag2))
try:
    print("str(tag1): %s" % str(tag1))
except UnicodeEncodeError:
    print("str(tag1): raises UnicodeEncodeError")
print("tag1.encode('utf-8'): ".encode('utf-8') + tag1.encode('utf-8'))

将上述代码复制并粘贴到文件中，然后在 Python 2.7 中运行后，我得到以下输出：

tag1 = u'\xa9wrt'
tag2 = '\xa9wrt'
tag1: ©wrt
tag2: ©wrt
type(tag1): <type 'unicode'>
type(tag2): <type 'unicode'>
tag1 == tag2: True
str(tag1): raises UnicodeEncodeError
tag1.encode('utf-8'): ©wrt

编辑：

如果您的代码在内部使用，您的生活会轻松得多unicode。这意味着，当您收到输入时，您将其转换为unicode，或者当您输出时，您转换为str（如果需要）。因此，当您从某个地方收到str 标记unicode名时，请将其转换为first。

例如，这里是test.py：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

def match_tag(tagname):
    if isinstance(tagname, str):
        # tagname comes in as str, so let's convert it
        tagname = tagname.decode('utf-8')  # enter the correct encoding here

    # Now that we have a unicode tag, we can deal with it easily:
    if tagname == '\xa9wrt':
        print("We have a match! tagname == %s" % tagname)
        print("Look! We printed tagname and no exception was raised.")

然后，我们运行它：

>>> from test import match_tag
>>> u = u'\xa9wrt'
>>> s = u.encode('utf-8')
>>> type(u)
<type 'unicode'>
>>> type(s)
<type 'str'>
>>> match_tag(u)
We have a match! tagname == ©wrt
Look! We printed tagname and no exception was raised.
>>> match_tag(s)
We have a match! tagname == ©wrt
Look! We printed tagname and no exception was raised.

因此，您需要找出输入字符串使用的编码。然后，您将能够将其转换str为unicode并且您的代码可以更好地流动。

编辑2：

如果您只是想开始s2 = '\xa9wrt'工作，那么您需要先正确解码。s2是一个str具有默认编码的（检查sys.getdefaultencoding()以查看哪个 - 可能ascii）。但是，\xa9不是 ASCII 字符，所以 Python 会自动转义它。这就是问题所在s2。喂它时试试这个match_tag()：

>>> s2 = '\xa9wrt'
>>> s2_decoded = s2.decode('unicode_escape')
>>> type(s2_decoded)  # This is unicode, just like we want.
<type 'unicode'>
>>> match_tag(s2_decoded)
We have a match! tagname == ©wrt
Look! We printed tagname and no exception was raised.

score 1 · Accepted Answer

该字符串以 Latin-1 编码，因此如果要将其存储在 UTF-8 文件中或将其与 UTF-8 字符串进行比较，只需执行以下操作：

>>> '\xa9wrt'.decode('latin-1').encode('utf-8')
'\xc2\xa9wrt'

或者，如果您想与 Unicode 字符串进行比较：

>>> '\xa9wrt'.decode('latin-1') == u'©wrt'
True

score 0 · Accepted Answer

我终于找到了一种用 unicode_literals 在 utf-8 文件中表达相关字符串的方法。我将字符串转换为十六进制然后返回。具体来说，在控制台（显然不是 unicode_literals 模式）中，我运行

"".join(["{0:x}".format(ord(c)) for c in '\xa9wrt'])

然后在我的源文件中，我可以创建我想要的字符串

'a9777274'.decode('hex')

但这不可能是正确的方法，不是吗？一方面，如果我的控制台以完整的 unicode 运行，我不知道我可以'\xa9wrt'首先输入字符串以让 Python 告诉我表示字节字符串的十六进制序列。

python - 在 utf-8 Python 代码中使用不可编码的 mp4 标签名称

3 回答 3

Related

Reference