python - 当我不解码为 utf-8 时出现 Python 意外行为

Question

我有以下功能

import urllib.request

def seek():
    web = urllib.request.urlopen("http://wecloudforyou.com/")
    text = web.read().decode("utf8")
    return text
texto = seek()
print(texto)

当我解码为 utf-8 时，我得到带有缩进和回车的 html 代码，就像在实际网站上看到的一样。

<!DOCTYPE html>
<html>
    <head>
       <title>We Cloud for You |

如果我删除.decode('utf8')，我会得到代码，但缩进消失了，它被\n.

<!DOCTYPE html>\n<html>\n    <head>\n       <title>We Cloud for You

那么，为什么会这样呢？据我所知，当您解码时，您基本上是将一些编码字符串转换为 Unicode。

我的 sys.stdout.encoding 是 CP1252（Windows 1252 编码）

根据这个线程：当默认编码为 ASCII 时，为什么 Python 会打印 unicode 字符？

Python 将非 unicode 字符串作为原始数据输出，而不考虑其默认编码。如果终端当前的编码与数据匹配，终端恰好会显示它们。- Python 使用 sys.stdout.encoding 中指定的方案编码后输出 Unicode 字符串。- Python 从 shell 环境中获取该设置。- 终端根据自己的编码设置显示输出。- 终端的编码独立于外壳的。

因此，似乎 python 需要先读取 Unicode 中的文本，然后才能将其转换为 CP1252，然后将其打印在终端上。但我不明白为什么如果文本没有被解码，它会将缩进替换为\n.

sys.getdefaultencoding()返回 utf8。

score 2 · Accepted Answer

在 Python 3 中，当您传递一个字节值（未经解码的来自网络的原始字节）时，您会看到字节值的表示为 Python 字节文字。这包括将换行符表示为\n字符。

通过解码，您现在有一个 unicode 字符串值，并且print()可以直接处理它：

>>> print(b'Newline\nAnother line')
b'Newline\nAnother line'
>>> print(b'Newline\nAnother line'.decode('utf8'))
Newline
Another line

这是完全正常的行为。

python - 当我不解码为 utf-8 时出现 Python 意外行为

1 回答 1

Related

Reference