我有这个代码:
import urllib
from bs4 import BeautifulSoup
url = 'http://www.brothersoft.com/windows/categories.html'
pageHtml = urllib.urlopen(url).read()
soup = BeautifulSoup(pageHtml)
for a in soup.select('div.brLeft a[href]'):
suburl = "http://www.brothersoft.com"+ a['href'].encode('utf-8', 'replace')
content = urllib.urlopen(suburl).read()
soup = BeautifulSoup(content)
for a in soup.select('div.coLeft.cate.mBottom dd a[href]'):
print "http://www.brothersoft.com"+a['href'].encode('utf-8', 'replace')
suburl = "http://www.brothersoft.com"+a['href'].encode('utf-8', 'replace')
content = urllib.urlopen(suburl).read()
soup = BeautifulSoup(content)
for a in soup.select('div.freeText dl a[href]'):
print "http://www.brothersoft.com"+ a['href'].encode('utf-8', 'replace')
suburl2 = "http://www.brothersoft.com"+ a['href'].encode('utf-8', 'replace')
content = urllib.urlopen(suburl2).read()
soup = BeautifulSoup(content)
for li in soup.select('div.Updated.coLeft li'):
print ' '.join(li.stripped_strings).encode('utf-8', 'replace')
当我执行此代码时,它将一直运行,直到出现此错误:
C:\Python27\lib\site-packages\bs4\builder\_htmlparser.py:155: RuntimeWarning: Py
thon's built-in HTMLParser cannot parse the given document. This is not a bug in
Beautiful Soup. The best solution is to install an external parser (lxml or htm
l5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/softw
are/BeautifulSoup/bs4/doc/#installing-a-parser for help.
"Python's built-in HTMLParser cannot parse the given document. This is not a b
ug in Beautiful Soup. The best solution is to install an external parser (lxml o
r html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/
software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))
Traceback (most recent call last):
File "C:\Documents and Settings\Fairuz\Desktop\soup7.py", line 26, in <module>
soup = BeautifulSoup(content)
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 183, in __init__
self._feed()
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 197, in _feed
self.builder.feed(self.markup)
File "C:\Python27\lib\site-packages\bs4\builder\_htmlparser.py", line 156, in
feed
raise e
HTMLParser.HTMLParseError: malformed start tag, at line 1, column 18498
这段代码有什么问题?首先它会一直运行到http://www.brothersoft.com/windows/photo_image/other_image_tools/ http://www.brothersoft.com/microsoft-office-visio-60485.html然后得到错误信息。