python - 在不正确的网页上抓取表单

Question

我正在尝试使用带有 python 3.4 的 robobrowser 抓取 html 表单。我使用默认的 html 解析器：

self._browser = RoboBrowser(history=True, parser="html.parser")

它适用于正确的网页，但现在我必须解析错误书写的页面。这是html片段：

<form method="post"  action="decide.php?act=submit_advance">
    <table  class="td_advanced">
    <tr class="td_advance">
    <td colspan="4" class="td_advance"></strong><br></td>
    <td colspan="3" class="td_left">Case sensitive:<br><br></td>
    <td><input type="checkbox" name="case_sensitive" /><br><br></td>
[...]
</form>

结束strong标签不正确。此错误会阻止解析器读取此错误标记后的所有输入：

form = self._browser.get_form()
print(form)
>>> <RoboForm>

有什么建议么？

score 0 · Accepted Answer

我自己找到了解决方案。关于 beautifulsoup 的评论很有帮助，让我的搜索找到了正确的方法。

解决方法是：使用另一个 html 解析器。我尝试使用lxml，它对我有用。

self._browser = RoboBrowser(history=True, parser="lxml")

由于 PyPI 目前没有与我的 python 版本一起使用的 lxml 安装程序，我从这里下载了它：http ://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml

python - 在不正确的网页上抓取表单

1 回答 1

Related

Reference