python - BeautifulSoup 没有读取格式错误的 html

翻译自：https://stackoverflow.com/questions/15290991 2013-03-08T09:46:05.567

210 次

我正在学习 BeautifulSoup。它没有正确阅读某些网站。我发现原因是一些 html 属性格式不正确。例如：

from bs4 import BeautifulSoup

html = """
        <html>
        <head><title>Test</title></head>
        <body>
        <p id="paraone"align="center">some content <b>para1</b>.<!--there is no space before 'align' attribute -->
        <p id="paratwo" align="blah">some content <b>para2</b>
        </html>
    """
soup = BeautifulSoup(html)
print "soup:", soup

我认为 BeautifulSoup 旨在不阅读格式错误的 html。如果是这样，是否有任何其他模块可以读取上述给定的 html？我们不能解析格式错误的网站吗？

python - BeautifulSoup 没有读取格式错误的 html

0 回答 0

Related

Reference