python-2.7 - 如何使用 python beautifulsoup 过滤和获取信息？

Question

我有这个代码：

import urllib
from bs4 import BeautifulSoup

url = 'http://www.brothersoft.com/windows/categories.html'
pageHtml = urllib.urlopen(url).read()
soup = BeautifulSoup(pageHtml)

for a in soup.select('div.brLeft a[href]'):
    suburl = "http://www.brothersoft.com"+ a['href'].encode('utf-8', 'replace')

    content = urllib.urlopen(suburl).read()
    soup = BeautifulSoup(content)
    for a in soup.select('div.coLeft.cate.mBottom dd a[href]'):
        print "http://www.brothersoft.com"+a['href'].encode('utf-8', 'replace')
        suburl = "http://www.brothersoft.com"+a['href'].encode('utf-8', 'replace')

        content = urllib.urlopen(suburl).read()
        soup = BeautifulSoup(content)
        for a in soup.select('div.freeText dl a[href]'):
            print "http://www.brothersoft.com"+ a['href'].encode('utf-8', 'replace')
            suburl2 = "http://www.brothersoft.com"+ a['href'].encode('utf-8', 'replace')

            content = urllib.urlopen(suburl2).read()
            soup = BeautifulSoup(content)
            for li in soup.select('div.Updated.coLeft li'):
                    print ' '.join(li.stripped_strings).encode('utf-8', 'replace')

当我执行此代码时，它将一直运行，直到出现此错误：

C:\Python27\lib\site-packages\bs4\builder\_htmlparser.py:155: RuntimeWarning: Py
thon's built-in HTMLParser cannot parse the given document. This is not a bug in
 Beautiful Soup. The best solution is to install an external parser (lxml or htm
l5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/softw
are/BeautifulSoup/bs4/doc/#installing-a-parser for help.
  "Python's built-in HTMLParser cannot parse the given document. This is not a b
ug in Beautiful Soup. The best solution is to install an external parser (lxml o
r html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/
software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))
Traceback (most recent call last):
  File "C:\Documents and Settings\Fairuz\Desktop\soup7.py", line 26, in <module>

    soup = BeautifulSoup(content)
  File "C:\Python27\lib\site-packages\bs4\__init__.py", line 183, in __init__
    self._feed()
  File "C:\Python27\lib\site-packages\bs4\__init__.py", line 197, in _feed
    self.builder.feed(self.markup)
  File "C:\Python27\lib\site-packages\bs4\builder\_htmlparser.py", line 156, in
feed
    raise e
HTMLParser.HTMLParseError: malformed start tag, at line 1, column 18498

这段代码有什么问题？首先它会一直运行到http://www.brothersoft.com/windows/photo_image/other_image_tools/ http://www.brothersoft.com/microsoft-office-visio-60485.html然后得到错误信息。

python-2.7 - 如何使用 python beautifulsoup 过滤和获取信息？

0 回答 0

Related

Reference