python - BeautifoulSoup 不适用于格式错误的 utf-8 HTML

Question

我开始玩，BeautifulSoup但它不起作用。刚刚尝试获取所有链接，find_all('a')并且响应始终为[]or null。问题可能是由 iso/utf-8 编码或格式错误的 html 引起的，对吧？

我发现如果我只在标签之间使用<body></body>代码，它就可以正常工作，所以我们可以丢弃编码。

那么该怎么办？是否有一个内置功能可以修复格式错误的 html？也许使用 RE 来获取<body>内容？有什么线索吗？这可能是一个常见的问题......

顺便说一句，我正在处理葡萄牙语（pt_BR）语言、Win64、Python27，示例不工作页面是http://www.tudogostoso.com.br/

编辑：到目前为止我做了什么

#im using mechanize
br = mechanize.Browser()
site = 'http://www.tudogostoso.com.br/'
r = br.open(site)

#returned html IS OK. outputed and tested a lot
html = r.read()

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

#nothing happens
#but if html = <body>...</body> (cropped manually) its works and prints all the links

score 0 · Accepted Answer

解决了感谢@abarnert

html5lib可以处理畸形。此外，HTML5它还有一些新特性，对于像我这样的人甚至对于旧的解析器来说可能看起来是畸形的，就像BeautifulSoup默认使用的解析器一样。它们并不是真正的畸形。

所以，最后，使用

soup = BeautifulSoup(html, "html5lib")

而不仅仅是

soup = BeautifulSoup(html)

刚刚做到了！

score -1 · Accepted Answer

对于下载页面，请使用一些模块谎言请求或urllib2。

Requests模块：

import requests
r = requests.get('http://www.tudogostoso.com.br/')
html = r.content
soup = BeautifulSoup(html)
for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

urllib2：

import urllib2
r = urllib2.urlopen('http://www.tudogostoso.com.br/')
html = r.read()
soup = BeautifulSoup(html)
for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

python - BeautifoulSoup 不适用于格式错误的 utf-8 HTML

2 回答 2

Related

Reference