python - 用 BeautifulSoup 抓取不同的元素：避免在嵌套元素中重复

Question

我想使用 BeautifulSoup4 从 lokal 保存的网站（python 文档）中获取不同的内容（类），所以我使用此代码来执行此操作（index.html 是这个保存的网站：https ://docs.python.org/3 /图书馆/stdtypes.html )

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
f = open('test.html','w')
f.truncate
classes= soup.find_all('dl', attrs={'class': ['class', 'method','function','describe', 'attribute', 'data', 'clasmethod', 'staticmethod']})
print(classes,file=f) 
f.close()

文件处理程序仅用于结果输出，对问题本身没有影响。

我的问题是结果是嵌套的。例如，方法“__eq__ (exporter) 将在 1. 类内部和 2. 作为独立的方法中找到。

所以我想删除其他结果中的所有结果，以使每个结果都在同一层次上。我怎样才能做到这一点？或者甚至可以在第一步中“忽略”该内容？我希望你明白我的意思。

score 1 · Accepted Answer

你不能告诉find忽略嵌套dl元素；您所能做的就是忽略出现在以下内容中的匹配项.descendants：

matches = []
for dl in soup.find_all('dl', attrs={'class': ['class', 'method','function','describe', 'attribute', 'data', 'clasmethod', 'staticmethod']})
    if any(dl in m.descendants for m in matches):
        # child of already found element
        continue
    matches.append(dl)

如果您想要嵌套元素并且没有父元素，请使用：

matches = []
for dl in soup.find_all('dl', attrs={'class': ['class', 'method','function','describe', 'attribute', 'data', 'clasmethod', 'staticmethod']})
    matches = [m for m in matches if dl not in m.descendants]
    matches.append(dl)

如果您想拆开树并从树中删除元素，请使用：

matches = soup.find_all('dl', attrs={'class': ['class', 'method','function','describe', 'attribute', 'data', 'clasmethod', 'staticmethod']})
for element in matches:
    element.extract()  # remove from tree (and parent `dl` matches)

但您可能想要调整您的文本提取。

python - 用 BeautifulSoup 抓取不同的元素：避免在嵌套元素中重复

1 回答 1

Related

Reference