python - 使用 beautifulsoup 解析 html 页面时丢失的信息

Question

我正在编写一个网络蜘蛛来从网站获取一些信息。当我解析这个页面http://www.tripadvisor.com/Hotels-g294265-oa120-Singapore-Hotels.html#ACCOM_OVERVIEW时，我发现有些信息丢失了，我使用soup.prettify()打印了html文档，并且html 文档与我使用 urllib2.openurl() 获得的文档不同，有些东西丢失了。代码如下：

    htmlDoc = urllib2.urlopen(sourceUrl).read()
    soup = BeautifulSoup(htmlDoc)

    subHotelUrlTags = soup.findAll(name='a', attrs={'class' : 'property_title'})
    print len(subHotelUrlTags)
    #if len(subHotelUrlTags) != 30:
    #   print soup.prettify()
    for hotelUrlTag in subHotelUrlTags:
        hotelUrls.append(website + hotelUrlTag['href'])

我尝试使用 HtmlParser 做同样的事情，它打印出以下错误：

 Traceback (most recent call last):
 File "./spider_new.py", line 47, in <module>
 hotelUrls = getHotelUrls()
 File "./spider_new.py", line 40, in getHotelUrls
 hotelParser.close()
 File "/usr/lib/python2.6/HTMLParser.py", line 112, in close
 self.goahead(1)
 File "/usr/lib/python2.6/HTMLParser.py", line 164, in goahead
 self.error("EOF in middle of construct")
 File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
 raise HTMLParseError(message, self.getpos())
 HTMLParser.HTMLParseError: EOF in middle of construct, at line 3286, column 1

score 1 · Accepted Answer

下载并安装lxml ..

它可以解析这样的“错误”网页。（HTML 可能以某种奇怪的方式被破坏，Python 的 HTML 解析器并不擅长理解这类事情，即使有 bs4 的帮助。）

此外，如果您安装 lxml，则无需更改代码，BeautifulSoup 会自动获取 lxml 并使用它来解析 HTML。

python - 使用 beautifulsoup 解析 html 页面时丢失的信息

1 回答 1

Related

Reference