python - 使用 beautifulsoup 在定义的范围内查找标签

Question

我使用 beautifulsoup 来提取数据。

我有这样一个 html 文件：

<div class=a>
<a href='google.com'>a</a>
</div>
<div class=b>
<a href='google.com'>c</a>
<a href='google.com'>d</a>
</div>

我想提取数据'c，d'，我不需要数据'a'

所以我这样做：

google_list = soup.findAll('a',href='google.com')
for item in google_list:
    print item.strings

它将打印 a、c、d。所以我的问题是如何在没有 'a' 的情况下打印 'c','d' in

score 4 · Accepted Answer

您可以根据div谁的类进行选择b，然后在该标签上使用您的原始查询，以便查找它的子项：

div = soup.find_all('div', {"class":"b"})[0]
items = div.find_all('a', href="google.com")

score 1 · Accepted Answer

几年前我停止使用 Beautiful soup 并更喜欢 lxml 库，它的 html 解析器很灵活，还允许 xpath 查询。

html = """<div class=a>
<a href='google.com'>a</a>
</div>
<div class=b>
<a href='google.com'>c</a>
<a href='google.com'>d</a>
</div>
"""
root = lxml.html.fromstring(html).getroottree()
root.xpath("//div[@class='b']/a[@href='google.com']/text()")
# ['c', 'd']

这会从所有引用“google.com”的锚点中找到所有文本，这些锚点位于任何具有“b”类的 div 内。

python - 使用 beautifulsoup 在定义的范围内查找标签

2 回答 2

Related

Reference