python - 在python中处理非ASCII码字符串

Question

在 python 中处理非 ascii 代码字符真的很令人困惑。谁能解释一下？

我正在尝试读取纯文本文件并将所有非字母字符替换为空格。

我有一个字符列表：

ignorelist = ('!', '-', '_', '(', ')', ',', '.', ':', ';', '"', '\'', '?', '#', '@', '$', '^', '&', '*', '+', '=', '{', '}', '[', ']', '\\', '|', '<', '>', '/', u'—')

对于我得到的每个令牌，我通过调用用空格替换该令牌中的任何字符

    for punc in ignorelist:
        token = token.replace(punc, ' ')

请注意，末尾有一个非 ascii 代码字符ignorelist：u'—'

每次我的代码遇到该字符时，它都会崩溃并说：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position

我试图通过# -*- coding: utf-8 -*-在文件顶部添加来声明编码，但仍然无法正常工作。有谁知道为什么？谢谢！

score 4 · Accepted Answer

您的文件输入不是 utf-8。因此，当您点击该 unicode 字符时，您的输入会在比较中出现错误，因为它将您的输入视为 ascii。

尝试用这个来读取文件。

import codecs
f = codecs.open("test", "r", "utf-8")

score 2 · Accepted Answer

您使用的是 Python 2.x，它会尝试自动转换unicodes 和 plain strs，但它通常会因非 ascii 字符而失败。

unicode你不应该把s 和s混str在一起。您可以坚持使用unicodes：

ignorelist = (u'!', u'-', u'_', u'(', u')', u',', u'.', u':', u';', u'"', u'\'', u'?', u'#', u'@', u'$', u'^', u'&', u'*', u'+', u'=', u'{', u'}', u'[', u']', u'\\', u'|', u'<', u'>', u'/', u'—')

if not isinstance(token, unicode):
    token = token.decode('utf-8') # assumes you are using UTF-8
for punc in ignorelist:
    token = token.replace(punc, u' ')

或仅使用普通strs（注意最后一个）：

ignorelist = ('!', '-', '_', '(', ')', ',', '.', ':', ';', '"', '\'', '?', '#', '@', '$', '^', '&', '*', '+', '=', '{', '}', '[', ']', '\\', '|', '<', '>', '/', u'—'.encode('utf-8'))
# and other parts do not need to change

通过手动将您编码u'—'为 a str，Python 不需要自己尝试。

我建议你unicode在你的程序中使用 all 来避免这种错误。但如果工作量太大，您可以使用后一种方法。但是，在调用标准库或第三方模块中的某些函数时要小心。

# -*- coding: utf-8 -*-只告诉 Python 你的代码是用 UTF-8 编写的（或者你会得到一个SyntaxError）。

python - 在python中处理非ASCII码字符串

2 回答 2

Related

Reference