python - BeautifulSoup 在遇到未转义的括号时解析失败

Question

我在加载包含文字（未转义）电子邮件标签的页面时遇到问题，例如

<html>
    <head>
            <title>Testing</title>
    </head>
    <body>
            <p>Testing testing.</p>
            <p>This is an email address for <joe@somewhere.com></p>
    </body>
</html>

遇到该块时解析失败：

文件“/tools/oss/packages/x86_64-rhel5/python/2.7.1/lib/python2.7/HTMLParser.py”，第 115 行，错误引发 HTMLParseError(message, self.getpos()) HTMLParseError: malformed start标签，第 748 行，第 82 列

我不敢相信我是第一个遇到这个问题的人，但我无法立即找到任何帮助或有用的文档。我错过了一些明显的东西吗？

谢谢，

——保罗

score 0 · Accepted Answer

始终如一 - 当您发布问题后，您会突然找到答案。

看起来我遇到了http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=516824中描述的错误- 更新到稍后的 BeautifulSoup 确实可以解决问题。

score -1 · Accepted Answer

这是 BeautifulSoup 的常见问题。它不处理格式错误的标签，因为它使用正则表达式来检测标签。

为 python 尝试 lxml。这是值得的，因为它类似于 BeautifulSoup。

http://lxml.de/elementsoup.html

PS：更新 BeautifulSoup 也可能会有所帮助。

python - BeautifulSoup 在遇到未转义的括号时解析失败

2 回答 2

Related

Reference