python - 在 python 源代码中查找非 ascii 字节串

Question

我所有的 python 源代码都以 utf-8 编码，并在文件顶部声明了此编码。

但有时u缺少 unicode 字符串之前。

例子Umlauts = "üöä"

上面是一个包含非 ascii 字符的字节串，这很麻烦（UnicodeDecodeError）。

我尝试了 pylint，python -3但我无法收到警告。

我搜索了一种自动方法来查找字节字符串中的非 ascii 字符。

我的源代码需要支持 Python 2.6 和 Python 2.7。

我得到这个众所周知的错误：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128)

顺便说一句：这个问题只是关于 python 源代码，而不是关于从文件或套接字读取的字符串。

解决方案

对于需要支持 Python 2.6+ 的项目，我将使用__future__.unicode_literals
对于需要支持 2.5 的项目，我将使用 thg435 的解决方案（模块 ast）

score 2 · Accepted Answer

当然，您想为此使用python！

import ast, re

with open("your_script.py") as fp:
    tree = ast.parse(fp.read())

for node in ast.walk(tree):
    if (isinstance(node, ast.Str) 
            and isinstance(node.s, str) 
            and  re.search(r'[\x80-\xFF]', node.s)):
        print 'bad string %r line %d col %d' % (node.s, node.lineno, node.col_offset)

请注意，这不区分裸字符和转义的非 ascii 字符（fuß和fu\xdf）。

python - 在 python 源代码中查找非 ascii 字节串

1 回答 1

Related

Reference