python - 将 Python 中的前 128 个字符从 subprocess.popen() 转换为在 JSON 中使用

Question

我正在调用subprocess.popen()xpdf程序pdfinfo，它返回的文本包括 8 位字符集上半部分中的一些字符。

我将结果传递给 JSON 序列化程序，当它到达字符\xae（® 符号）时它会抱怨：

>>> import json
>>> json.dumps({'a':'Adobe\xae'})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\app\python\2.7.3\lib\json\__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "c:\app\python\2.7.3\lib\json\encoder.py", line 201, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "c:\app\python\2.7.3\lib\json\encoder.py", line 264, in iterencode
    return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 5: invalid start byte

我怎样才能解决这个问题？我对编解码器以及我应该在哪里添加正确的信息来帮助 Python 弄清楚如何处理它感到非常困惑。

编辑：如果输入来自我（或至少是我的源代码），而不是另一个进程，我可以只使用 Unicode 字符串文字

>>> json.dumps({'a':u'Adobe\u00ae'})
'{"a": "Adobe\\u00ae"}'

Python 会处理得很好。

但我不知道给 Python 什么提示来将 pdfinfo 的输出解码为 Unicode。

score 2 · Accepted Answer

首先，您需要弄清楚您返回的数据的字符编码是什么。我猜它是Windows-1252，它在代码点 0xAE 处有符号“®”。因此，要对其进行解码，您将使用以下str.decode功能：

raw_data = 'Adobe\xae'
decoded = raw_data.decode('Windows-1252')
print decoded  # Prints "Adobe®"

score 1 · Accepted Answer

有一个ensure_asciijson编码的参数。

>>> json.dumps({'a':u'Adobe\u00ae'}, ensure_ascii=False)
u'{"a": "Adobe\xae"}'
>>> print json.dumps({'a':u'Adobe\u00ae'}, ensure_ascii=False)
{"a": "Adobe®"}

如果ensure_ascii是False，则结果可能包含非 ASCII 字符，返回值可能是 unicode 实例。

score 0 · Accepted Answer

@Adam 对 str.decode 的回答给了我一个提示。此外，该pdfinfo程序谢天谢地接受一个参数编码（-enc [encoding]）参数，所以我可以做-enc UTF-8，然后使用

raw_data.decode('UTF-8')

在 Python 中。

python - 将 Python 中的前 128 个字符从 subprocess.popen() 转换为在 JSON 中使用

3 回答 3

Related

Reference