python - BeautifulSoup / lxml：大元素有问题吗？

Question

import os, re, sys, urllib2
from bs4 import BeautifulSoup
import lxml

html = urllib2.urlopen("http://www.hoerzu.de/tv-programm/jetzt/")
soup = BeautifulSoup(html, "lxml")
divs = soup.find_all("div", {"class":"block"})
print len(divs)

输出：

ActivePython 2.7.2.5 (ActiveState Software Inc.) based on
Python 2.7.2 (default, Jun 24 2011, 12:21:10) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, re, sys, urllib2
>>> from bs4 import BeautifulSoup
>>> import lxml
>>>
>>> html = urllib2.urlopen("http://www.hoerzu.de/tv-programm/jetzt/")
>>> soup = BeautifulSoup(html, "lxml")
>>> divs = soup.find_all("div", {"class":"block"})
>>> print len(divs)
2

我也试过：

divs = soup.find_all(class_="block")

结果相同...

但是有 11 个元素符合这个条件。那么是否有任何限制，例如最大元素大小。我怎样才能得到所有的元素？

score 4 · Accepted Answer

最简单的方法可能是使用“html.parser”而不是“lxml”：

import os, re, sys, urllib2
from bs4 import BeautifulSoup
import lxml

html = urllib2.urlopen("http://www.hoerzu.de/tv-programm/jetzt/")
soup = BeautifulSoup(html, "html.parser")
divs = soup.find_all("div", {"class":"block"})
print len(divs)

使用您的原始代码（使用lxml）1为我打印，但打印的是11. lxml是宽松的，但不像html.parser这个页面那么宽松。

请注意，如果您通过tidy. 包括无效的字符代码、未闭合<div>的 s、字母 like<和/在它们不允许的位置。

python - BeautifulSoup / lxml：大元素有问题吗？

1 回答 1

Related

Reference