python - 将python脚本输出输出到文件时出现Unicode错误

Question

这是代码：

print '"' + title.decode('utf-8', errors='ignore') + '",' \
      ' "' + title.decode('utf-8', errors='ignore') + '", ' \
      '"' + desc.decode('utf-8', errors='ignore') + '")'

title 和 desc 由 Beautiful Soup 3 （p[0].text和p[0].prettify）返回，据我所知，BeautifulSoup3 文档是 UTF-8 编码的。

如果我跑

python.exe script.py > out.txt

我收到以下错误：

Traceback (most recent call last):
  File "script.py", line 70, in <module>
    '"' + desc.decode('utf-8', errors='ignore') + '")'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 264
: ordinal not in range(128)

但是，如果我跑

python.exe script.py

我没有错误。仅在指定输出文件时才会发生。

如何在输出文件中获得好的 UTF-8 数据？

score 12 · Accepted Answer

您可以使用 codecs 模块将 unicode 数据写入文件

import codecs
file = codecs.open("out.txt", "w", "utf-8")
file.write(something)

'print' 输出到标准输出，如果您的控制台不支持 utf-8，即使您将标准输出通过管道传输到文件，也可能导致此类错误。

score 7 · Accepted Answer

在这种情况下，Windows 的行为有点复杂。您应该听取其他建议，并在内部使用 unicode 作为字符串并在输入期间进行解码。

对于您的问题，您需要在 stdout 重定向的情况下打印编码字符串（只有您知道哪种编码！），但在简单屏幕输出的情况下您必须打印 unicode 字符串（并且 python 或 Windows 控制台处理转换为正确编码）。

我建议以这种方式构建您的脚本：

# -*- coding: utf-8 -*- 
import sys, codecs
# set up output encoding
if not sys.stdout.isatty():
    # here you can set encoding for your 'out.txt' file
    sys.stdout = codecs.getwriter('utf8')(sys.stdout)

# next, you will print all strings in unicode
print u"Unicode string ěščřžý"

更新：另见其他类似问题：在 Python 中管道标准输出时设置正确的编码

score 1 · Accepted Answer

将文本转换为 unicode 以进行打印是没有意义的。使用 unicode 处理您的数据，将其转换为某种编码以进行输出。

你的代码做了什么：你在 python 2 上，所以你的默认字符串类型 ( str) 是一个字节串。在您的语句中，您从一些 utf 编码的字节字符串开始，将它们转换为 unicode，用引号将它们括起来（str为了组合成一个字符串而强制转换为 unicode 的正则）。然后将此 unicode 字符串传递给print，然后将其推送到sys.stdout. 为此，它需要将其转换为字节。如果您正在写入 Windows 控制台，它可以以某种方式进行协商，但如果您重定向到一个常规的哑文件，它会退回到 ascii 并抱怨，因为没有无损的方法可以做到这一点。

解决方案：不要提供printunicode 字符串。将其“编码”为您选择的表示形式：

print "Latin-1:", "unicode über alles!".decode('utf-8').encode('latin-1')
print "Utf-8:", "unicode über alles!".decode('utf-8').encode('utf-8')
print "Windows:", "unicode über alles!".decode('utf-8').encode('cp1252')

当您重定向时，所有这些都应该毫无怨言地工作。它可能在您的屏幕上看起来不正确，但使用记事本或其他工具打开输出文件，看看您的编辑器是否设置为查看格式。（Utf-8 是唯一有希望被检测到的。cp1252 可能是 Windows 默认值）。

一旦你明白了，清理你的代码并避免使用 print 来输出文件。使用该codecs模块，并使用codecs.open而不是普通打开方式打开文件。

PS。如果您正在解码utf-8字符串，则转换为 unicode 应该是无损的：您不需要该errors=ignore标志。当您转换为 ascii 或 Latin-2 或其他什么时，这是合适的，并且您只想删除目标代码页中不存在的字符。

score 0 · Accepted Answer

问题：如果您在 Windows 上运行：

python.exe script.py

以下内容将生效：

sys.stdout.encoding: utf-8
sys.stdout.isatty(): True

但是，如果你运行：

python.exe script.py > out.txt

你将有效地拥有这个：

sys.stdout.encoding: cp1252
sys.stdout.isatty(): False

因此，可能的解决方案（IN PYTHON > 3.7）：

import sys
if not sys.stdout.isatty():
    sys.stdout.reconfigure(encoding='utf-8')

print '"' + title.decode('utf-8', errors='ignore') + '",' \
      ' "' + title.decode('utf-8', errors='ignore') + '", ' \
      '"' + desc.decode('utf-8', errors='ignore') + '")'

另请参阅：如何在 Python 3 中设置 sys.stdout 编码？

python - 将python脚本输出输出到文件时出现Unicode错误

4 回答 4

Related

Reference