“html5lib”的相关标签问题_Stack Overflow中文网

0 投票

2 回答

121 浏览

python - 为什么 HTML 节点的文本使用 HTMLParser 为空？

在以下示例中，我期望获得Foo文本<h2>：

不幸的是，我得到了''. 为什么？

奇怪的是， foo 在文本中：

那么在哪里Foo呢？

2019-08-06T12:12:28.250

0 投票

1 回答

127 浏览

python - 如何检查 HTML 中的哪一行触发错误？

我有以下代码从 html 文件中删除重复的段落。

几乎可以工作，但是对于某些元素，我收到此错误

有没有办法在发生错误的 HTML 文件中打印行号以检查格式是什么？

代码没有问题的元素结构是这样的

但是由于文件有点大，我没有确定代码卡住的元素的结构。

python python-3.x debugging beautifulsoup html5lib

2020-03-04T19:54:29.027

0 投票

0 回答

851 浏览

python - 运行 beautifulsoup 时出错（模块 'html5lib.treebuilders' 没有属性 '_base'）

我是编程和 Python 的新手。我正在尝试在 Python3 上安装 BeutifulSoup 来学习 MOOC 的网络抓取（使用 Jupyter Notebooks 作为 IDE）。当我运行时，from bs4 import BeautifulSoup我收到以下错误

AttributeError Traceback (last last call last) in 3 import ssl 4 print("done2") ----> 5 from bs4 import BeautifulSoup 6 print("done3")

~\Desktop\py4e\code3\code3\bs4__init__.py in 28 import warnings 29 ---> 30 from .builder import builder_registry, ParserRejectedMarkup 31 from .dammit import UnicodeDammit 32 from .element import (

~\Desktop\py4e\code3\code3\bs4\builder__init__.py in ' 312 register_treebuilders_from(_htmlparser) 313 try: --> 314 from . import _html5lib 315 register_treebuilders_from(_html5lib) 316 除了 ImportError：

~\Desktop\py4e\code3\code3\bs4\builder_html5lib.py in 68 69 ---> 70 类 TreeBuilderForHtml5lib(html5lib.treebuilders._base.TreeBuilder): 71 72 def init (self, soup, namespaceHTMLElements):

AttributeError：模块“html5lib.treebuilders”没有属性“_base”

我尝试了以下解决方案：

1) pip install html5lib==0.9999999 2) pip install --upgrade html5lib==1.0b8 3) pip install --upgrade bleach==1.4.2 3) 更新 BeutifulSoup (pip install) 4) 降级后重新安装 html5lib 到最新版本不起作用

非常感谢您的帮助！

python python-3.x beautifulsoup jupyter-notebook html5lib

2020-04-24T10:46:14.037

0 投票

2 回答

100 浏览

python - 使用python从网站上抓取表格并尝试获取带有文本的内容的超链接

我正在学习 python，我正在尝试从https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-1-company.html网站上抓取一张表格。在这张表中，您可以看到“CIN”、“公司名称”、“Roc”和“状态”有 4 列。如您所见，“公司名称”是一个超链接，我需要 5 列“CIN”、“公司名称” ，“公司链接”，“大鹏”和“状态”。同样，我写了一个代码，但我只有 4 列，而不是“公司链接”，我得到了不同的结果。我正在分享我的输出 csv 的屏幕截图文件。

请帮我在“CIN”、“公司名称”、“公司链接”、“Roc”和“状态”的 5 列中抓取此表。这是我的代码，请找到我的输出 csv 文件的图像。

python web-scraping beautifulsoup python-requests html5lib

2020-07-09T19:40:42.990

0 投票

1 回答

768 浏览

python - 在 conda env 中使用 pandas.read_html() 函数时出现错误“找不到 html5lib”

当前代码：

我想在设置'flavor' arg = 'bs4' 或'html5lib' 时使用pandas.read_html() 函数从页面中提取html。我收到错误：ImportError: html5lib not found，请安装它。

但我肯定在环境中安装了 bs4 和 html5lib。运行 conda list 命令后：

我不知道为什么 pandas 函数不能识别这些包。有多个其他帖子处理相同的问题，但没有一个解决方案对我有用。

例如，一些类似这样的帖子： Python: ImportError: lxml not found, please install it and

以上答案建议使用 pip3 安装软件包。当我运行这些命令时，我得到以下信息。

感谢您对类似问题的任何帮助或参考！

谢谢！

python pandas beautifulsoup html5lib

2020-07-14T20:00:55.120

0 投票

1 回答

574 浏览

python - 我试图点击展开按钮，然后刮桌子

我正在抓取网站表格https://csr.gov.in/companyprofile.php?year=FY+2015-16&CIN=L00000CH1990PLC010573但我没有得到我正在寻找的确切结果。我想要此链接中的 11 列，“公司名称”、“类别”、“状态”、“公司类型”、“RoC”、“子类别”、“列表状态”。这些是 7 列，之后您可以看到一个展开按钮“2017-18 财年的 CSR 详细信息”，当您单击该按钮时，您将获得另外 4 列“平均净利润”、“CSR 规定支出”、“CSR 支出” ", "当地花费"。我想要 csv 文件中的所有这些列。我写了一个代码，它不能正常工作。我附上结果图片以供参考。这是我的代码。