python - 使用 python 在 HTML 代码中查找特定注释

Question

我在 python 中找不到特定的注释，例如. 我的主要原因是在 2 个特定评论中找到所有链接。类似于解析器的东西。我试过这个Beautifulsoup：

import urllib
over=urlopen("www.gamespot.com").read()
soup = BeautifulSoup(over)
print soup.find("<!--why-->")

但它不起作用。我想我可能不得不使用regex而不是Beautifulsoup.

请帮忙。

示例：我们有这样的 HTML 代码

<!--why-->
www.godaddy.com
<p> nice one</p>
www.wwf.com
<!-- why not-->

编辑：在 2 条评论之间，可能存在其他东西，如标签。

我需要存储所有链接。

score 6 · Accepted Answer

如果你想要所有的评论，你可以使用findAll一个可调用的：

>>> from bs4 import BeautifulSoup, Comment
>>> 
>>> s = """
... <p>header</p>
... <!-- why -->
... www.test1.com
... www.test2.org
... <!-- why not -->
... <p>tail</p>
... """
>>> 
>>> soup = BeautifulSoup(s)
>>> comments = soup.findAll(text = lambda text: isinstance(text, Comment))
>>> 
>>> comments
[u' why ', u' why not ']

一旦你得到它们，你可以使用通常的技巧来移动：

>>> comments[0].next
u'\nwww.test1.com\nwww.test2.org\n'
>>> comments[0].next.split()
[u'www.test1.com', u'www.test2.org']

根据页面的实际外观，您可能需要对其进行一些调整，并且您必须选择所需的评论，但这应该可以帮助您入门。

编辑：

如果您真的只想要看起来像某些特定文本的那些，您可以执行类似的操作

>>> comments = soup.findAll(text = lambda text: isinstance(text, Comment) and text.strip() == 'why')
>>> comments
[u' why ']

或者您可以在事后使用列表理解过滤它们：

>>> [c for c in comments if c.strip().startswith("why")]
[u' why ', u' why not ']

python - 使用 python 在 HTML 代码中查找特定注释

1 回答 1

Related

Reference