现在我正在使用 python 编写一个网络爬虫,但有时它会抛出 HTMLParserError:
junk characters in start tag: u'\u201dTPL_password_1\u201d\r\n\t\t', at line 21285, column 6
它说错误出现在第 21285 行,是否意味着错误出现在 HTML 源代码的第 21285 行?如果没有,我怎么知道当前产生错误的 HTML 代码是什么?当前的解析网址是什么?
我的解析类可以简化如下:
class ParsePage(HTMLParser):
"""Parse the given page content using HTMLParser"""
def __init__(self):
HTMLParser.__init__(self)
def handle_starttag(self, tag, attrs):
#Here i tried to add `try...expect` to inspect the current tag and attrs, but it seems python didnt enter the except at all, why? the error message said the error was found at start tag, why it didnt enter the except at all?
try:
Some codes doing with the start tag...
except HTMLParser.HTMLParseError, e:
print "e: ", e, '\n'
print 'tag: ', tag, '\n'
print 'attrs: ', atts, '\n'
exit(1)
def handle_endtag(self, tag):
#Some codes doing with end tags...
geturl = ParsePage()
#Here i can catch the HTMLParseError if i add `try...except` in the following line, but i dont know how to get the useful information here when i catch the exception
geturl.feed(cur_page)
谢谢你的帮助。