python - 当beautifulsoup中的属性为中文时如何获取标签

Question

我不熟悉beautifulsoup 的编码。

当我处理一些页面时，一些属性是中文的，我想用这个中文属性来提取标签。

例如，如下所示的 html：

<P class=img_s>
<A href="/pic/93/b67793.jpg" target="_blank" title="查看大图">
<IMG src="/pic/93/s67793.jpg">
</A>
</P>

我想提取'/pic/93/b67793.jpg'所以我所做的是：

img_urls = form_soup.findAll('a',title='查看大图')

并遇到：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb2 in position 0: ordinalnot in range(128)

为了解决这个问题，我做了两种方法，都失败了：一种方法是：

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

另一种方法是：

response = unicode(response, 'gb2312','ignore').encode('utf-8','ignore')

score 6 · Accepted Answer

您需要将 unicode 传递给 findAll 方法：

# -*- coding: utf-8
... 
img_urls = form_soup.findAll('a', title=u'查看大图')

请注意标题值前面的uunicode 文字标记。您确实需要在源文件上指定编码才能使其正常工作（coding文件顶部的注释），或者改用 unicode 转义码：

img_urls = form_soup.findAll('a', title=u'\u67e5\u770b\u5927\u56fe')

在内部，BeautifulSoup 使用 unicode，但您传递给它的是一个包含非 ascii 字符的字节字符串。BeautifulSoup 尝试将其解码为 unicode 并失败，因为它不知道您使用的编码。通过为它提供现成的 unicode，您可以回避这个问题。

工作示例：

>>> from BeautifulSoup import BeautifulSoup
>>> example = u'<P class=img_s>\n<A href="/pic/93/b67793.jpg" target="_blank" title="<A href="/pic/93/b67793.jpg" target="_blank" title="\u67e5\u770b\u5927\u56fe"><IMG src="/pic/93/s67793.jpg"></A></P>'
>>> soup = BeautifulSoup(example)
>>> soup.findAll('a', title=u'\u67e5\u770b\u5927\u56fe')
[<a href="/pic/93/b67793.jpg" target="_blank" title="查看大图"><img src="/pic/93/s67793.jpg" /></a>]

score 1 · Accepted Answer

1

Beautiful Soup 4.1.0会自动从 UTF-8 转换属性值，从而解决了这个问题：

于 2012-06-23T17:42:38.647 回答

python - 当beautifulsoup中的属性为中文时如何获取标签

2 回答 2

Related

Reference