python - 为什么 Python 坚持使用 ascii？

Question

使用 Requests 和 Beautiful Soup 解析 HTML 文件时，以下行在某些网页上引发异常：

if 'var' in str(tag.string):

这是上下文：

response = requests.get(url)  
soup = bs4.BeautifulSoup(response.text.encode('utf-8'))

for tag in soup.findAll('script'):
    if 'var' in str(tag.string):    # This is the line throwing the exception
        print(tag.string)

这是一个例外：

UnicodeDecodeError：“ascii”编解码器无法解码位置 15 中的字节 0xc3：序数不在范围内（128）

我已经尝试过使用和不使用该行中的encode('utf-8')函数BeautifulSoup，它没有区别。我确实注意到，对于抛出异常的页面Ã，javascript 中的注释中有一个字符，即使 response.encoding 报告的编码是ISO-8859-1. 我确实意识到我可以使用 unicodedata.normalize 删除有问题的字符，但是我更愿意将tag变量转换为utf-8并保留字符。以下方法都不能帮助将变量更改为utf-8：

tag.encode('utf-8')
tag.decode('ISO-8859-1').encode('utf-8')
tag.decode(response.encoding).encode('utf-8')

我必须对此字符串做什么才能将其转换为可用utf-8？

score 3 · Accepted Answer

好的，所以基本上你会得到一个以Latin-1. 给你问题的字符确实是Ã因为看这里你可能会看到这0xC3正是 Latin-1 中的那个字符。

我认为您对您想象的有关解码/编码请求的每种组合都进行了盲目测试。首先，如果你这样做：if 'var' in str(tag.string):只要stringvar 包含非 ASCII 字节，python 就会抱怨。

查看您与我们共享的代码，恕我直言，正确的方法是：

response = requests.get(url)
# decode the latin-1 bytes to unicode  
#soup = bs4.BeautifulSoup(response.text.decode('latin-1'))
#try this line instead
soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding)

for tag in soup.findAll('script'):
    # since now soup was made with unicode strings I supposed you can treat
    # its elements as so
    if u'var' in tag.string:    # This is the line throwing the exception
        # now if you want output in utf-8
        print(tag.string.encode('utf-8'))

编辑：查看BeautifiulSoup 4 文档中的编码部分对您很有用

基本上，逻辑是：

你得到了一些在编码中编码的字节X
X你通过做解码bytes.decode('X') and this returns a unicode byte sequence
你使用 unicode
您将 unicode 编码Y为输出的某种编码ubytes.encode('Y')

希望这能给问题带来一些启示。

score 2 · Accepted Answer

您也可以尝试使用 Unicode Dammit lib（它是 BS4 的一部分）来解析页面。详细说明在这里：http ://scriptcult.com/subcategory_176/article_852-use-beautifulsoup-unicodedammit-with-lxml-html.html

python - 为什么 Python 坚持使用 ascii？

2 回答 2

Related

Reference