8

Is there a difference between the capabiities of lxml and html5lib parsers in the context of beautifulsoup? I am trying to learn to use BS4 and using the following code construct --

ret = requests.get('http://www.olivegarden.com')
soup = BeautifulSoup(ret.text, 'html5lib')
for item in soup.find_all('a'): 
    print item['href']

I started out with using lxml as the parser but noticed that for some websites the for loop just is never entered even though there are valid links in the page. The same page works with html5ib parser. Are there any specific type of pages that might not work with lxml?

I am on Ubuntu using python-lxml 2.3.2-1 with libxml2 2.7.8.dfsg-5.1ubunt and html5lib-1.0b3

EDIT: I updated to lxml 3.1.2 and still see the same issue. On a mac though running 3.0.x the same page is being parsed properly. The website in question is www.olivegarden.com

4

1 回答 1

10

html5lib 使用HTML 规范中定义的HTML 解析算法,并在所有主要浏览器中实现。lxml 使用 libxml2 的 HTML 解析器——这最终基于他们的 XML 解析器,并且不遵循对在其他任何地方使用的无效 HTML 的任何错误处理。

大多数 Web 开发人员只使用 Web 浏览器进行测试——该死的标准——所以如果你想获得页面作者的意图,你可能需要使用与当前浏览器匹配的 html5lib 之类的东西,

于 2013-09-04T17:11:16.270 回答