python - 将字节转换为字符串

Question

我正在使用此代码从外部程序获取标准输出：

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]

communicate() 方法返回一个字节数组：

>>> command_stdout
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

但是，我想将输出用作普通的 Python 字符串。这样我就可以像这样打印它：

>>> print(command_stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

我认为这就是binascii.b2a_qp()方法的用途，但是当我尝试它时，我又得到了相同的字节数组：

>>> binascii.b2a_qp(command_stdout)
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

如何将字节值转换回字符串？我的意思是，使用“电池”而不是手动操作。我希望 Python 3 可以。

score 5012 · Accepted Answer

您需要解码 bytes 对象以生成字符串：

>>> b"abcde"
b'abcde'

# utf-8 is used here because it is a very common encoding, but you
# need to use the encoding your data is actually in.
>>> b"abcde".decode("utf-8") 
'abcde'

请参阅：https ://docs.python.org/3/library/stdtypes.html#bytes.decode

score 343 · Accepted Answer

您需要解码字节字符串并将其转换为字符 (Unicode) 字符串。

在 Python 2 上

encoding = 'utf-8'
'hello'.decode(encoding)

或者

unicode('hello', encoding)

在 Python 3 上

encoding = 'utf-8'
b'hello'.decode(encoding)

或者

str(b'hello', encoding)

score 243 · Accepted Answer

我认为这种方式很简单：

>>> bytes_data = [112, 52, 52]
>>> "".join(map(chr, bytes_data))
'p44'

score 118 · Accepted Answer

如果您不知道编码，那么要以 Python 3 和 Python 2 兼容的方式将二进制输入读入字符串，请使用古老的 MS-DOS CP437编码：

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('cp437'))

因为编码是未知的，所以期望非英文符号翻译成cp437（英文字符不被翻译，因为它们在大多数单字节编码和 UTF-8 中匹配）。

将任意二进制输入解码为 UTF-8 是不安全的，因为您可能会得到以下信息：

>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte

这同样适用于latin-1Python 2 的流行（默认？）。查看代码页布局中的缺失点- 这是 Python 与臭名昭著ordinal not in range的 .

更新 20150604：有传言称 Python 3 具有将surrogateescape内容编码为二进制数据而不会丢失数据和崩溃的错误策略，但它需要转换测试[binary] -> [str] -> [binary]，以验证性能和可靠性。

更新 20170116：感谢 Nearoo 的评论 - 也有可能使用backslashreplace错误处理程序对所有未知字节进行斜线转义。这仅适用于 Python 3，因此即使使用此解决方法，您仍然会从不同的 Python 版本中获得不一致的输出：

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('utf-8', 'backslashreplace'))

有关详细信息，请参阅Python 的 Unicode 支持。

更新 20170119：我决定实现适用于 Python 2 和 Python 3 的斜线转义解码。它应该比cp437解决方案慢，但它应该在每个 Python 版本上产生相同的结果。

# --- preparation

import codecs

def slashescape(err):
    """ codecs error handler. err is UnicodeDecode instance. return
    a tuple with a replacement for the unencodable part of the input
    and a position where encoding should continue"""
    #print err, dir(err), err.start, err.end, err.object[:err.start]
    thebyte = err.object[err.start:err.end]
    repl = u'\\x'+hex(ord(thebyte))[2:]
    return (repl, err.end)

codecs.register_error('slashescape', slashescape)

# --- processing

stream = [b'\x80abc']

lines = []
for line in stream:
    lines.append(line.decode('utf-8', 'slashescape'))

score 113 · Accepted Answer

在 Python 3中，默认编码是"utf-8"，所以可以直接使用：

b'hello'.decode()

这相当于

b'hello'.decode(encoding="utf-8")

另一方面，在 Python 2中，编码默认为默认字符串编码。因此，您应该使用：

b'hello'.decode(encoding)

encoding你想要的编码在哪里。

注意：在 Python 2.7 中添加了对关键字参数的支持。

score 49 · Accepted Answer

我认为你实际上想要这个：

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> command_text = command_stdout.decode(encoding='windows-1252')

Aaron 的回答是正确的，只是您需要知道要使用哪种编码。而且我相信 Windows 使用“windows-1252”。仅当您的内容中有一些不寻常的（非 ASCII）字符时才重要，但它会有所作为。

顺便说一句，它确实很重要的事实是 Python 转向对二进制和文本数据使用两种不同类型的原因：它不能在它们之间进行神奇的转换，因为除非你告诉它，否则它不知道编码！您知道的唯一方法是阅读 Windows 文档（或在此处阅读）。

score 40 · Accepted Answer

由于这个问题实际上是在询问subprocess输出，因此您可以使用更直接的方法。最现代的方法是使用subprocess.check_output和传递text=True（Python 3.7+）使用系统默认编码自动解码标准输出：

text = subprocess.check_output(["ls", "-l"], text=True)

对于 Python 3.6，Popen接受编码关键字：

>>> from subprocess import Popen, PIPE
>>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0]
>>> type(text)
str
>>> print(text)
total 0
-rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt

如果您不处理子进程输出，则标题中问题的一般答案是将字节解码为文本：

>>> b'abcde'.decode()
'abcde'

没有参数，sys.getdefaultencoding()将被使用。如果您的数据不是，那么您必须在调用sys.getdefaultencoding()中明确指定编码：decode

>>> b'caf\xe9'.decode('cp1250')
'café'

score 38 · Accepted Answer

将universal_newlines设置为True，即

command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]

score 34 · Accepted Answer

要将字节序列解释为文本，您必须知道相应的字符编码：

unicode_text = bytestring.decode(character_encoding)

例子：

>>> b'\xc2\xb5'.decode('utf-8')
'µ'

ls命令可能会产生无法解释为文本的输出。Unix 上的文件名可以是除斜杠b'/'和零之外的任何字节序列b'\0'：

>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()

尝试使用 utf-8 编码对此类字节汤进行解码会引发UnicodeDecodeError.

情况可能更糟。如果您使用错误的不兼容编码，解码可能会静默失败并产生mojibake ：

>>> '—'.encode('utf-8').decode('cp1252')
'â€”'

数据已损坏，但您的程序仍然不知道发生了故障。

一般来说，使用什么字符编码并不嵌入字节序列本身。您必须在带外传达此信息。有些结果比其他结果更有可能，因此chardet存在可以猜测字符编码的模块。一个 Python 脚本可能在不同的地方使用多个字符编码。

lsos.fsdecode() 可以使用即使对于不可解码的文件名也能成功的函数将输出转换为 Python 字符串（它在 Unix 上使用 sys.getfilesystemencoding()和surrogateescape错误处理程序）：

import os
import subprocess

output = os.fsdecode(subprocess.check_output('ls'))

要获取原始字节，您可以使用os.fsencode().

如果您传递universal_newlines=True参数，则subprocess用于 locale.getpreferredencoding(False)解码字节，例如，它可以 cp1252在 Windows 上。

要即时解码字节流， io.TextIOWrapper() 可以使用：example。

不同的命令可能对其输出使用不同的字符编码，例如，dir内部命令 ( cmd) 可能使用 cp437。要解码其输出，您可以显式传递编码（Python 3.6+）：

output = subprocess.check_output('dir', shell=True, encoding='cp437')

文件名可能与os.listdir()（使用 Windows Unicode API）不同，例如，'\xb6'可以替换为'\x14'——Python 的 cp437 编解码器映射b'\x14'来控制字符 U+0014 而不是 U+00B6 (¶)。要支持具有任意 Unicode 字符的文件名，请参阅将可能包含非 ASCII Unicode 字符的 PowerShell 输出解码为 Python 字符串

score 28 · Accepted Answer

虽然@Aaron Maenpaa 的回答很有效，但一位用户最近问：

还有更简单的方法吗？'fhand.read().decode("ASCII")' [...] 太长了！

您可以使用：

command_stdout.decode()

decode()有一个标准的论点：

codecs.decode(obj, encoding='utf-8', errors='strict')

score 17 · Accepted Answer

如果您应该通过尝试获得以下信息decode()：

AttributeError：“str”对象没有属性“decode”

您还可以直接在强制转换中指定编码类型：

>>> my_byte_str
b'Hello World'

>>> str(my_byte_str, 'utf-8')
'Hello World'

score 12 · Accepted Answer

如果您遇到此错误：

'utf-8 编解码器无法解码字节 0x8a'

，那么最好使用以下代码将字节转换为字符串：

bytes = b"abcdefg"
string = bytes.decode("utf-8", "ignore")

享受！

score 9 · Accepted Answer

我做了一个清理列表的功能

def cleanLists(self, lista):
    lista = [x.strip() for x in lista]
    lista = [x.replace('\n', '') for x in lista]
    lista = [x.replace('\b', '') for x in lista]
    lista = [x.encode('utf8') for x in lista]
    lista = [x.decode('utf8') for x in lista]

    return lista

score 9 · Accepted Answer

对于 Python 3，这是一种更安全且Pythonicbyte的从to转换方法string：

def byte_to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes): # Check if it's in bytes
        print(bytes_or_str.decode('utf-8'))
    else:
        print("Object not of byte type")

byte_to_str(b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n')

输出：

total 0
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

score 9 · Accepted Answer

在处理来自 Windows 系统的数据（带有\r\n行尾）时，我的答案是

String = Bytes.decode("utf-8").replace("\r\n", "\n")

为什么？用多行 Input.txt 试试这个：

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8")
open("Output.txt", "w").write(String)

你所有的行尾都将加倍（到\r\r\n），导致额外的空行。Python 的文本读取函数通常对行尾进行规范化，以便字符串仅使用\n. 如果您从 Windows 系统接收二进制数据，Python 没有机会这样做。因此，

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8").replace("\r\n", "\n")
open("Output.txt", "w").write(String)

将复制您的原始文件。

score 5 · Accepted Answer

从sys — 系统特定的参数和功能：

要从/向标准流写入或读取二进制数据，请使用底层二进制缓冲区。例如，要将字节写入标准输出，请使用sys.stdout.buffer.write(b'abc').

score 5 · Accepted Answer

对于“运行 shell 命令并将其输出作为文本而不是字节”的特定subprocess.run情况，在 Python 3.7 上，您应该使用并传入text=True（以及capture_output=True捕获输出）

command_result = subprocess.run(["ls", "-l"], capture_output=True, text=True)
command_result.stdout  # is a `str` containing your program's stdout

text曾经被称为universal_newlines，并在 Python 3.7 中被更改（嗯，别名）。如果要支持 3.7 之前的 Python 版本，请传入universal_newlines=True而不是text=True

score 4 · Accepted Answer

4

用解码.decode()。这将解码字符串。传入'utf-8') 作为内部的值。

于 2021-07-09T02:09:41.513 回答

score 3 · Accepted Answer

def toString(string):    
    try:
        return v.decode("utf-8")
    except ValueError:
        return string

b = b'97.080.500'
s = '97.080.500'
print(toString(b))
print(toString(s))

score 3 · Accepted Answer

如果要转换任何字节，而不仅仅是转换为字节的字符串：

with open("bytesfile", "rb") as infile:
    str = base64.b85encode(imageFile.read())

with open("bytesfile", "rb") as infile:
    str2 = json.dumps(list(infile.read()))

然而，这不是很有效。它将一张 2 MB 的图片变成 9 MB。

score 3 · Accepted Answer

3

试试这个

bytes.fromhex('c3a9').decode('utf-8')

于 2020-01-19T08:19:02.080 回答

score 2 · Accepted Answer

尝试使用这个；此函数将忽略所有非字符集（如utf-8）二进制文件并返回一个干净的字符串。它经过测试python3.6及以上。

def bin2str(text, encoding = 'utf-8'):
    """Converts a binary to Unicode string by removing all non Unicode char
    text: binary string to work on
    encoding: output encoding *utf-8"""

    return text.decode(encoding, 'ignore')

在这里，该函数将获取二进制文件并对其进行解码（使用 python 预定义的字符集将二进制数据转换为字符，并且该ignore参数忽略二进制文件中的所有非字符集数据，并最终返回您想要的string值。

如果您不确定编码，请使用sys.getdefaultencoding()获取设备的默认编码。

score 0 · Accepted Answer

bytes.decode(encoding='utf-8', errors='strict') 我们可以使用For 文档解码字节对象以生成字符串。点击这里

Python3例子：

byte_value = b"abcde"
print("Initial value = {}".format(byte_value))
print("Initial value type = {}".format(type(byte_value)))
string_value = byte_value.decode("utf-8")
# utf-8 is used here because it is a very common encoding, but you need to use the encoding your data is actually in.
print("------------")
print("Converted value = {}".format(string_value))
print("Converted value type = {}".format(type(string_value)))

输出：

Initial value = b'abcde'
Initial value type = <class 'bytes'>
------------
Converted value = abcde
Converted value type = <class 'str'>

注意：在 Python3 中，默认编码类型是utf-8. 所以，<byte_string>.decode("utf-8")也可以写成<byte_string>.decode()

python - 将字节转换为字符串

23 回答 23

Related

Reference